Putting StackOverflow tags into Graphistry

In my previous article, I described how to use my script to show relations between users and tags. This time, we will go into technical details on crawling StackOverflow, transforming data into graphs and using Graphistry to draw them.

Technical side

You can download the whole application from the following repository: https://github.com/data-hunters/tech-skills-visualizer. Now, let’s go through the most interesting parts.

Crawler

The core part of the script is that of a crawler. I use the stackapi package. To start fetching data, we need to create client object:

StackAPI(api, max_pages=100)

Here is the tricky thing. We use an official and free API which doesn’t require registration but it has its limits. If you reach the threshold, you will be blocked for 24 hours. I wasn’t able to crawl the whole dataset for bigdata tag, so I had to decrease max_pages. Unfortunately, it means that we don’t have all the data, but since we are sorting questions by the number of votes and we are sorting tags by popularity, we have quite a representative sample.

Crawling starts with questions endpoint and runs through the following steps.

Gathering the list of questions linked to the tags provided by the user

questions = self.site.fetch('questions', tagged=tags, sort='votes', order='desc')

Building chunks of 100 questions IDs

IDs need to be converted into a string and grouped into chunks (100 is max number of IDs) to effectively fetch the data from StackOverflow API.

question_ids = [str(q['question_id']) for q in questions['items']]
chunked_question_ids = chunk(question_ids, 100)

Crawling answers based on chunks

Each chunk needs to be converted into a comma-separated list, which as a string constitutes the main parameter for the endpoint. We also want to sort by votes descending. Here, we have two important elements:

Filtering out users who don’t exist.

Building a map (that will be used in the last step), where the key stands for user ID and the value = user name.

In the last step, we remove duplicated users to avoid crawling them more than once.

for q_ids_chunk in chunked_question_ids:
    answers = self.site.fetch(f'questions/{";".join(q_ids_chunk)}/answers', sort='votes', order='desc')
    user_ids = []
    for a in answers['items']:
        if a['owner']['user_type'] != self.U_NOT_EXIST:
            user_ids.append({'id': a['owner']['user_id'], 'name': a['owner']['display_name']})
    all_users = all_users + user_ids
    for u in user_ids:
        users_map[u['id']] = u['name']
all_users = [u['id'] for u in get_unique_elements(all_users)]

Collecting users’ tags

In the last step, we are fetching tags of all the collected users by means of the same chunking approach. We need to iterate over the items field to retrieve ID, the number of posts belonging to a given account, and the name of the tag. As you can see, we are also using the map that we built in the previous step in order to get the name of the user.

tags = []
for u_ids_chunk in chunked_user_ids:
    fetched_users_tags = self.client.fetch(f'users/{";".join(u_ids_chunk)}/tags', sort='popular', order='desc')
    users_tags = []
    for t in fetched_users_tags['items']:
        users_tags.append({'name': t['name'], 'count': t['count'], 'user_id': t['user_id'],
                           'user_name': users_map[t['user_id']]})
    tags.extend(users_tags)

Transforming data into a graph

To show data on a graph, we need to build two collections:

  1. Nodes – in our case, we will have two types: users and tags. A graph with at least two different types of nodes (or edges) is called heterogeneous.
  2. Edges connecting the user and tag nodes. In our case, single edge tells us that the user posted at least one answer to the question marked with the tag. Our edge will have a direction from the user to a given tag. A graph with edges that have a certain direction is called a directed graph.

Building an edge list

It’s pretty simple as the final list provided by our crawler contains nothing more than just edges. Graphistry supports Pandas DataFrame so we convert our lists into DataFrame with the right columns.

def build_final_edges(edges, edges_id_field='name', user_id_field='user_id', tags_count_field='count'):
    return pd.DataFrame.from_dict({
        'tag_id': [e[edges_id_field] for e in edges],
        'user_id': [str(e[user_id_field]) for e in edges],
        'tags_count': [e[tags_count_field] for e in edges]
    })

Building a node list

We want to do this using the edge list. At first, we create two lists with unique elements: tags/technologies and users. Each element of the edge list is a dictionary object. We cannot use the set as dict is not hashable, but we can build a temporary dictionary where the key becomes our unique identifier (tag id or user id) and value becomes the whole object. By doing so, we remove duplicates. Now, we can simply get values of the dictionary. The following method can be used to do so:

def get_unique_elements(users, id_field='id'):
    return {u[id_field]: u for u in users}.values()

Once we’ve used this method to build two lists, we can build a base node list.

def build_base_nodes(technologies, users):
    nodes = [{'id': t['name'], 'label': t['name']} for t in technologies]
    nodes += [{'id': u['user_id'], 'label': u['user_name']} for u in users]
    return nodes

It would also be nice to have two different types of colours for tags/technologies and users. Let’s assign the blue colour for users and the red one for technologies.

def build_node_colors(technologies, users, tech_color=0xFF000000, user_color=0x0000FF00):
    node_colors = [tech_color for i in range(0, len(technologies))]
    node_colors += [user_color for i in range(0, len(users))]
    return node_colors

Finally, we can build our node list.

def build_final_nodes(nodes, node_colors, id_field='id', label_field='label'):
    return pd.DataFrame.from_dict({
        'id': [str(n[id_field]) for n in nodes],
        'label': [n[label_field] for n in nodes],
        'color': node_colors
    })

Drawing a graph

Graphistry provides a library for pushing data via their API. Basically it’s a paid service. You have four options to choose from:

  1. Create a free account on Graphistry Hub and use a free tier (we’ll choose this one). It has limited resources and functionality.
  2. Buy a subscription on Graphistry Hub.
  3. Deploy Graphistry on cloud (AWS or Azure).
  4. Deploy anywhere including the on-premise option.

Free tier is sufficient for this tutorial purposes. When you have an account and data in the form of edges and nodes, it’s pretty easy to draw a graph. Let’s start with registering client.

graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username='<USERNAME>', password='<PASSWORD>') 

I’ve mentioned that Graphistry supports DataFrame but it handles Arrow structures in a better way, so before we call the plot method, let’s convert nodes and edges to Arrow Tables.

edges_arr = pa.Table.from_pandas(edges_df)
nodes_arr = pa.Table.from_pandas(nodes_df)

Now, we can plot the graph.

source_col = 'user_id'
dest_col = 'tag_id'
url = graphistry.edges(edges_arr, source_col, dest_col) 
    .nodes(nodes_arr) 
    .bind(source=source_col,
          destination=dest_col,
          node='id',
          point_color='color',
          point_title='label',
          edge_weight='weight') 
    .settings(url_params={
        'edgeOpacity': 0.4,
        'edgeSize': 25,
        'pointsOfInterestMax': 100
    }).plot(render=False)
print(url)

First we assign edges with information on the column containing the source ID and target ID. The most interesting methods are as follows:

  • bind – here, we specify the name of columns for other elements such as colour and label of the node or weight of the edge.
  • settings – these are options I’ve recommended at the beginning of this article, but these are my personal preferences.
  • plot – this method triggers the actual action. If you remove render=False, it will just return URL to the graph without opening a web browser.

And that’s pretty much all 🙂

Running an application

If you want to run my script, you need to install Python 3. If you have it already, follow these steps:

  • Clone the repo: git clone git@github.com:data-hunters/tech-skills-visualizer.git
  • Go to the project: cd tech-skills-visualizer
  • Install the requirements: pip3 install -r requirements.txt
  • Run the script: python3 tsvis/run.py <tags> <max_pages> where <tags> are the list of tags separated by semi-colons and <max_pages> determnes how many pages will be fetched per StackExchange endpoint to retrieve the data. Tags will be joined with AND operator, so if you want to fetch the data by tag bigdata and spark crawling max 65 pages, you should run: python3 tsvis/run.py bigdata;spark 65.

You can also use Binder and run the script in a Web Browser. Check out our step by step instruction on YT: https://www.youtube.com/watch?v=j-LOCUM1dp4

Related Posts
Welcome to our blog!

If you are searching for information related to Big Data, mostly focused on Open Source Intelligence, you are in the Read more

Extracting metadata (Exif) with Metadata Digger

In last post we mentioned about Metadata Digger. It’s a tool for extracting and analyzing metadata from huge amounts of Read more

Solr – Full Text Search for Big Data and OSINT

Today we want to make a basic introduction to Apache Solr and then (in the following post) use it as Read more

Indexing metadata from images to Solr with Metadata Digger

In one of our last video posts we presented how to extract metadata from images and save to CSV (you Read more

Leave a Reply

Your email address will not be published. Required fields are marked *