Extracting Skills from Personal Communication Data using StackExchange Dataset

This blog post is a summary of our published work at ACM CIKM. The project is about automatically profiling the skills of users by analyzing their personal communication data. We considered this as a prediction problem, given the messages of the user we had to predict the skills of the user. We made of use of the stack exchange dataset which is freely available here, as a training set. There are many stackexchange websites like stackoverflow, cs, datascience, physics, history and so on. This dataset covers a diverse set of skills and will be automatically updated if new technologies come to the fore.

Building the Knowledge Base:

A post in stack exchange is either a question, answer or a comment. Each post will be associated with a set of tags and these tags are considered as skills. We use the stack exchange knowledge base as a training set to predict the tags. In this project, we decided to implement a K-NN multi label classification model using lucene. Lucene is a text search engine written in java. But we used pyLucene, which is a python wrapper to lucene. To build this search engine, first we need to index all the posts with two fields ‘text’ and ‘tags’. The ‘text’ field is the body of the post with some preprocessing. This process is done for all stackexchange websites and indexed into one file system.

Extracting the skills:

If we are given a set of messages (from an instant messaging platform) of an individual, the task is to predict the tags for each message. A message is used as a query to the search engine. The searching is done on the ‘text’ field. A score is associated with every tag and it is initialized to zero. We retrieve top k most similar posts associated with the message along with the similarity value and tags. The similarity value is added to the tags. This is done for each message. In the end, we saw that dividing the score with the count of occurrences of the tags gave good results. Finally the tags which have larger values are declared as the skills of an individual.
The pseudo code is given below:

Initially score for all tags is zero
for each user in set_of_users:
for each message generated by user:
sim_score,tags = find_similar_posts(message)
score[tags] += sim_score

We tested this application on the apache spark mailing lists (link) to extract the skills of the users part of the discussion. The results are shown in the below:

User Email Skills
sowen at cloudera.com apache spark, hadoop, java, scala, spark
joseph at databricks.com machine learning, apache spark, gradient descent, flex, random forest
git at git.apache.org github, git, jira, infrastructure, bugzilla

The code files can be found in the github here. Also the code to extract files from the apache spark mailing list is present in the repo.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s