Overlapping community detection in labeled graphs

In a recent paper with Esther Galbrun and Nikolaj Tatti, presented in the journal of Data Mining and Knowledge Discovery, we worked on the problem of discovering overlapping communities in networks with labeled vertices. The model is motivated by social networks, where vertex labels are used to represent information about individuals, such as occupation, hobbies, preferences, etc. The hypothesis is that the vertex labels can be used to derive and explain the community structure in the network. For example, if a set of individuals in a social network are friends with each other and if they all mention that they are Aalto University graduates, it is reasonable to assume that this is a community of classmates in Aalto.

Motivated by the above example, we formulate the problem of finding dense communities in a network so that each community can be described by a small set of labels. An illustration of the idea is shown in the toy network below.  In this case we have discovered two communities, one specified by the labels {‘a’,’c’}, and one specified by the label {‘b’}. Note that we allow vertices to belong to multiple communities, but we require that each edge is assigned to a single community. Note also that not all vertices and all edges need to be assigned to a community.

 

toy

A real-world example is shown below. In this case the graph corresponds to the ego-network of Christos Papadimitriou in the DBLP co-authorship graph, and the labels are keywords extracted from paper titles. We see co-authors of Christos grouped in different communities, each one specified by a small set of keywords and corresponding to a different research area.

pap

One of the interesting aspects of our work was comparing with other state-of-the-art methods. As it often happens in the area of data mining, a new paper addresses a new problem that opts to optimize an objective that does not already exist in the literature. In this context, comparison with existing methods can be quite tricky. A reason that so many papers in data mining address new problems, is that as new interesting datasets become available and as novel applications emerge, defining the right problem and selecting the right objective to work is part of the challenge. The situation here is somewhat in contrast with other areas in computer science, such as theoretical computer science, where problems are typically more crisply defined and progress with respect to state-of-the-art is easier to quantify.

In our work, to obtain a fair comparison we tried to be as comprehensive as possible: we tested many different methods and measured the performance of those methods on a variety of objectives, even objectives that none of the methods optimizes directly. To visualize the results we used polar charts, where each objective is normalized between values 0 and 1.  We redefined scores, when needed, so that values closer to 1 indicate good performance.

Using this approach, if a method produces results that cover a large area in the polar chart, this tends to indicate good performance. However, this cannot be taken to an extreme, as one would not know how to weight the importance of the different axes in the polar chart, and as some measures are neutral. Nevertheless, we found that this visualization approach allows us to obtain a better sense for the behavior of the different methods and a better understand their trade offs.

The paper can be found here, and the code that implements our methods here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s