Graphlab datasets

Graphlab is the workhorse behind the smarttypes twitter clustering algorithm. This page houses an example dataset, and short description of the data in the dataset.


smarttypes_pmf is an adjacency matrix in graphlab pmf format. 7,876users x 7,876users, and 62,031,376 edges. Each edge is a 1 or 0, indicating whether or not the person on the y axis follows the person on the x axis. Here's a nice description of adjacency matrices on wikipedia.

I got the 7,876 users by looking at the people i follow on twitter, and the people they follow. See for more details on going from twitter users to graphlab.

The factorization took 3421.6s on a ec2 High-CPU Extra Large Instance.

./pmf smarttypes_pmf 11 --float=true --scheduler="round_robin(max_iterations=10,block_size=1)" --ncpus=7 --binaryoutput=true --zero=true --D=100 --desired_factor_sparsity=0.8 --lambda=0.001


index_to_twitter_id.pickle is a python dictionary mapping a user's position in the graphlab pmf to their respective twitter id. This is useful if you want to find out more about the users being clustered. See for more details on going from graphlab to the actual twitter user.

Shoot me an email @ w/ any questions.