By Geraud Le Falher, Michael Mathioudakis, and Aris Gionis
Our friend Oliver lives in London, where he works as a consultant for a big financial company. Occasionally, he takes a business trip to another major city, to seal a major deal, and make major buck for his boss. Of course, Oliver being Oliver, he always finds the time to enjoy whatever city he happens to be. When in London, he likes to suit up and hang out in Soho, a “predominantly fashionable district of upmarket restaurants.” He would like to do that also in Rome, where he’s flying to next week, but he doesn’t know much about that city. Where is the Soho of Rome? What neighborhood of Rome is most similar to Soho?
To answer questions like the one asked by our friend, we analyze data from Foursquare, a popular location-based social network, that, as of the time of this writing, claims more than 50 million users. Foursquare enables users to share their current location with friends, rate and comment on venues they visit (places such as restaurants, hotels, cafeterias, bookshops, and museums) and read reviews of venues that other users have left.
Foursquare users share their activity by generating “check-ins” with Swarm, the service’s mobile application for location-sharing. Each check-in is associated with a web page that contains information about the user and the venue they visit. Each venue is also associated with a public web page on Foursquare that contains information about the venue and aggregate statistics from user check-ins. For example, you can click here to see where one of the authors of this post checked-in on Sunday and here for the public webpage of that venue.
For the purposes of our study, we compiled a dataset of about 5 million check-ins and 340 thousand venues listed on Foursquare, from 20 cities in North America and Europe. The North American city with the most data in our dataset is New York, with 1.1 million check-ins and 68 thousand venues, while the European one with most data is Moscow with 426 thousand check-ins and 48 thousand venues. Helsinki participates in the dataset with 43 thousand check-ins and 3 thousand venues.
We take neighborhoods to be areas of closed shape on a city’s geographic map. Each area is associated with the set of venues within its boundaries. In turn, each venue is associated with a category (e.g., restaurant, museum, cafeteria, etc.), a geographic location, and a number of features that describe user activity at that venue, such as the number of users who have checked-in that venue, its average rating, and the times it is open and most busy with visitors.
The aforementioned features allow us to distinguish different types of venues — e.g., venues that are busy early in the morning from ones that are busy late at night. As a very clear case of that, we find that there is a good clustering of venues according to the time distribution of associated check-ins (see figure below). Furthermore, given that each neighborhood is directly associated with a set of venues, we can use the same features to describe neighborhoods and measure their similarity. To revisit Oliver’s question, given the set of venues in Soho and their features, we wish to identify a neighborhood in Rome that contains venues with similar features.
To make our search faster, we limited it to circular neighborhoods of predefined radius. Moreover, we experimented with various similarity measures that we evaluated against manually collected ground-truth data. We found a similarity measure based on Earth mover’s distance to perform best.
In the figures below, you can see one example of matching neighborhoods we find using the best-performing similarity measure.
For more details on this work, you can check Geraud’s master’s thesis. Finally, Oliver, if you’re reading this, our results say that the Soho of Rome is Trastevere.
- Swarm, Foursquare’s check-in mobile application, appeared after we completed our study. Until then, users checked-in on Foursquare using the service’s standard mobile application.
- Note that Foursquare check-ins are not publicly available by default, unless the user who generated them shares publicly their associated urls. Our dataset consists of Foursquare check-ins that are shared publicly on Twitter, as well as Foursquare check-ins made available by Cheng et al. (pdf from aaai.org).