This blog post is a summary of our published work at ACM CIKM. The project is about automatically profiling the skills of users by analyzing their personal communication data. We considered this as a prediction problem, given the messages of the user we had to predict the skills of the user. We made of use of the stack exchange dataset which is freely available here, as a training set. There are many stackexchange websites like stackoverflow, cs, datascience, physics, history and so on. This dataset covers a diverse set of skills and will be automatically updated if new technologies come to the fore.
Building the Knowledge Base:
A post in stack exchange is either a question, answer or a comment. Each post will be associated with a set of tags and these tags are considered as skills. We use the stack exchange knowledge base as a training set to predict the tags. In this project, we decided to implement a K-NN multi label classification model using lucene. Lucene is a text search engine written in java. But we used pyLucene, which is a python wrapper to lucene. To build this search engine, first we need to index all the posts with two fields ‘text’ and ‘tags’. The ‘text’ field is the body of the post with some preprocessing. This process is done for all stackexchange websites and indexed into one file system.
Extracting the skills:
If we are given a set of messages (from an instant messaging platform) of an individual, the task is to predict the tags for each message. A message is used as a query to the search engine. The searching is done on the ‘text’ field. A score is associated with every tag and it is initialized to zero. We retrieve top k most similar posts associated with the message along with the similarity value and tags. The similarity value is added to the tags. This is done for each message. In the end, we saw that dividing the score with the count of occurrences of the tags gave good results. Finally the tags which have larger values are declared as the skills of an individual.
The pseudo code is given below:
Initially score for all tags is zero
for each user in set_of_users:
for each message generated by user:
sim_score,tags = find_similar_posts(message)
score[tags] += sim_score
We tested this application on the apache spark mailing lists (link) to extract the skills of the users part of the discussion. The results are shown in the below:
User Email | Skills |
---|---|
sowen at cloudera.com | apache spark, hadoop, java, scala, spark |
joseph at databricks.com | machine learning, apache spark, gradient descent, flex, random forest |
git at git.apache.org | github, git, jira, infrastructure, bugzilla |
The code files can be found in the github here. Also the code to extract files from the apache spark mailing list is present in the repo.