Modeling complexity w/ python (part 2)

Part 1 of our journey takes an abstract look @ complexity, determinism, and probabilistic graphical models. We then turn our attention to human social networks, and the growing inertia around online social network analysis.

"The Quality which creates the world emerges as a relationship between man and his experience. He is a participant in the creation of all things. The measure of all things." -- Robert Pirsig

The point of this post is to move from crazy theory to actual code. The final piece of the puzzle (an actual working implementation and source code) will be available @ www.smarttypes.org. We will update the github repo along the way.

Finding expert twitter users

We'll focus on twitter, but the design and methods outlined here should work for any online network where people link to people, and people link to content.

We have a simple, straightforward goal: Given a search phrase, we want to return the most relevant/influential twitter users for that phrase. For example, if we enter the search phrase 'Cleveland Machine Learning' we should get back a list of bad-ass Cleveland geeks.

There are a lot of ways to approach this problem. We could simply index the content each user posts, and return the users with the highest keyword frequency. We could get more sophisticated, and use term frequency–inverse document frequency (tf–idf) to give more weight to uncommon words. More sophisticated still, we can use a topic model to discover latent features (or topics) that occur across users.

In fact, we will use tf–idf and topic models, but we're missing something. We're neglecting our most interesting asset, the people!

Not long ago, social scientists spent lots to document social connections, and speculate on these connections influenced behavior. Now any schlub with a computer and an imagination can immerse himself in this brand of study.

Whether from homophily (birds of a feather flock together) or social contagion, we know like minded people tend to be connected.

high_level_info_design.jpg

Getting the data

The first step is getting some data to play w/. This is trivial if you have the cash. Hundreds of social media, data-mining and financial-services companies pay a base rate of up to $360,000 a year for Twitter's information. There are currently two companies licensed to sell twitter data -- Gnip, in Boulder, Colo. and Datasift, in Reading, U.K

If you don't have the cash, you'll have to work to get the data. Or you can just leverage my hard work. Here's a little script that uses tweepy to pull social graph data (who follows who). OAuth calls are permitted 350 requests per hour and are measured against the oauth_token used in the request. The more oauth_tokens you have (signups to use your app) the more data you can pull. Learn more about twitter rest-api rate limits here.

Getting actual tweets is a little different. You open up a connection, and drink from the stream. The default access level allows up to 400 track keywords, 5,000 follow userids, and 25 0.1-360 degree location boxes. Learn more about twitter streaming-api rate limits here. Here's an incomplete script to get tweets.

Probabilistic Models and LDA

The following is from Peter Norvig's 'On Chomsky and the Two Cultures of Statistical Learning', referring to Leo Breiman's 2001 paper Statistical Modeling: The Two Cultures

"Breiman is inviting us to give up on the idea that we can uniquely model the true underlying form of nature's function from inputs to outputs. Instead he asks us to be satisfied with a function that accounts for the observed data well, and generalizes to new, previously unseen data well, but may be expressed in a complex mathematical form that may bear no relation to the "true" function's form (if such a true function even exists)."

Latent Dirichlet Allocation (LDA) is an example of a graphical model. Learning the various distributions (the set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) is a problem of bayesian inference.

Gensim - Vector Space Modeling for Humans

Looking for LDA implementations in python, i was pleasantly surprised to find Gensim. Gensim is memory independent, distributed implementation made to efficiently process large, web-scale corpora.

" The unsupervised algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover hidden (latent) semantic structure, based on word co-occurrence patterns within a corpus of training documents. Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents and so on. "

Back to our original problem

We will use LDA to cluster our social network. (For brevity's sake i'll simply point to this nice tutorial on using LDA on a social graph.) This gives use a list of communities. Each community is composed of a list of (user_participation_score, user_index) tuples.

#communities
num_communities = 2
community_lda = gensim.models.ldamodel.LdaModel(social_network, 
                                                num_topics=num_communities, 
                                                update_every=0, passes=20)
communities = community_lda.show_topics(topics=-1, formatted=False)

We then iterate through all the content posted by the users in each community, and weight the content w/ the user_participation_score score. This gives us community_text for each community weighted by user participation in that community.

#map user_corpus to community_text
community_text = defaultdict(lambda: defaultdict(int))
for i in range(len(communities)):
    for user_participation_score, user_index in communities[i]:
        for word_id, _ in user_corpus[int(user_index)]:
            community_text[i][word_id] += 1 * user_participation_score

We then use gensim's tfidf (term frequency–inverse document frequency) transformation model to give uncommon words a higher value.

Finally, we use LDA to find latent topics within the community content.

#community_corpus_topics
num_topics = 3
topic_lda = gensim.models.ldamodel.LdaModel(
    community_corpus, id2word=user_corpus_dictionary, 
    num_topics=num_topics, update_every=0, passes=20)
topics = topic_lda.show_topics(topics=-1, formatted=False)

lda_design.jpg

That's it, we're done w/ our model, now all that's left is to take a search_phrase, convert it to a topic_vector, and compare the search_topic_vector w/ all the community_topic_vectors using gensim cosine similarity.

Actually, there's a little more. We don't want to pigeonhole ourselves to one community. Imagine if our search returns two closely related communities. We don't want to just pick one community, and show the members in that community. There may be members in the other community that are more relevant than lower ranking members of our first community. To handle this we need to multiply our search-vector/topic-vector similarity score w/ the user_participation_score. This is the final step.

lda_design.jpg

Conclusion

We can do more w/ our model. The first thing that comes to mind is incorporating time to see moving trends and predict the future. We also need to think about persistence, concurrency, and scaling to support big data. We'll save that for another day. The final piece of the puzzle (the actual working implementation and source code) will be available @ www.smarttypes.org. We will update the github repo along the way.

I hope these posts were an inspiring introduction to complexity, social network analysis, and graphical models. If you're interested in this stuff i recommend: