In our recent KDD paper, with Polina Rozenshtein, Aris Anagnostopoulos, and Nikolaj Tatti, we worked on the problem of finding events in graphs. We abstracted the event-finding problem with the following simple formulation: Given a graph with node weights and edge distances, find a subset of nodes (the event) that have large sum of weights and are well connected. In the paper we showed how to use this formulation to find interesting events in real-world datasets. For example, using data crawled from the Barcelona public bike-sharing system, we are able to find the time and place of various city festivals, such as the Primavera Sound music festival, on June 1, 2012 (left figure below), and the neighborhood festival of Poblenou, on September 18, 2012 (right figure).
In addition to getting this kind of nice figures, the paper makes some interesting algorithmic connections, which I would like to discuss here. First, let’s go back to the objective function and see what it means for the nodes of the event subgraph to be well connected. There are many ways to express this. One way that we consider in the paper is to ask to minimize the sum of distances over all pairs of nodes in the subgraph. Since we also want to find nodes with large weights, we can combine the two objectives by asking to maximize the sum of node weights minus the sum of all pairwise distances (plus some constant term to ensure that the objective is non-negative).
This resembles the densest-subgraph problem. While in general finding dense subgraphs is a very difficult computational task, there is one formulation that is “easy.” In particular, it is possible to find in polynomial time a subgraph with maximum average degree. This was shown in 1984 by Andrew Goldberg, in a technical report, and the proof involves a very pretty mapping of the density problem to the min-cut problem.
Inspired by Goldberg’s technique, we can map the event-finding problem to a graph cut problem, but this time to max-cut. This is an important difference, since while the min-cut problem is polynomially-time solvable, the max-cut problem is NP-hard. But an approximation algorithm, due to Goemans and Williamson, which is based on the semidefinite-programming (SDP) method, provides an approximation guarantee of 0.878 for the max-cut problem. So by applying Goldberg’s technique we can map our problem to max-cut, and then we can use the SDP method to obtain the same approximation factor of 0.878. Pretty neat.
What about practical considerations? Solving SDPs is not scalable to very large graphs. Hopefully, one can show that our objective function is submodular, so we can apply a greedy randomized algorithm and get an approximation guarantee of 1/2. And while the SDP method is theoretically superior, in practice the greedy is as good.