Graphlab datasets

Graphlab is the workhorse behind the smarttypes twitter clustering algorithm. This page houses an example dataset, and short description of the data in the dataset.

smarttypes_pmf

smarttypes_pmf is an adjacency matrix in graphlab pmf format. 7,876users x 7,876users, and 62,031,376 edges. Each edge is a 1 or 0, indicating whether or not the person on the y axis follows the person on the x axis. Here's a nice description of adjacency matrices on wikipedia.

I got the 7,876 users by looking at the people i follow on twitter, and the people they follow. See smarttypes_to_graphlab.py for more details on going from twitter users to graphlab.

The factorization took 3421.6s on a ec2 High-CPU Extra Large Instance.

./pmf smarttypes_pmf 11 --float=true --scheduler="round_robin(max_iterations=10,block_size=1)" --ncpus=7 --binaryoutput=true --zero=true --D=100 --desired_factor_sparsity=0.8 --lambda=0.001

index_to_twitter_id.pickle

index_to_twitter_id.pickle is a python dictionary mapping a user's position in the graphlab pmf to their respective twitter id. This is useful if you want to find out more about the users being clustered. See graphlab_to_smarttypes.py for more details on going from graphlab to the actual twitter user.

Shoot me an email @ hello@smarttypes.org w/ any questions.