Hadoop @ foursquare

Talk given by Blake Shaw – data scientist and Joe Crobak -engineer from foursquare

foursquare has 20m+ users and it records 1500 actions per second.

Blake talked about mining signals from check-ins and build social recommendation engine on top it ( called Explorer).

He talked about the place. The place is not only a physical entity but also a place where people meet and interact in a certain timeframe.

Blake showed one of the map created by million of people who checkin at one location. It was the map of central park created by checkin!!!

They collect, analyze and report on time signature for places.

By mining the check-in data, they can discover the correlation between event and sale. For example – hot weather is correlated with the high checkin near ice cream shops. They can recommend people place to go based on weather pattern.

They also analyze the sentiment data – called happiness on foursquare. They analyze the comments people are making during their checkin and discover the negative or positive sentiments.

Blake uses algorithms of finding similar items -Critical for their recommendation engine.

Large sparse k-nearest Neighbors is one of the main algorithm.

He is also computing venue similarity for the recommendation purposes.

Blake also showed the excellent visualization of historical checkin on the world map at different times. He used Matlab to create that visualization.

Joe Crobak talked about the hadoop infrastructure in details –

Th infrastructure handles ~1.5B log events per week

Infrastructure components –

Cloudera CDH3u3
12 node Hadoop cluster in EC2
Hive
Pig
Solr
mongoDB
Ozzie
Scala
Google spreadsheet -automated reporting
Scoobi – hadoop MR connector for Scala

They started with elastic map reduce (EMR) and faced few limitations and moved to their own Hadoop cluster in AWS. One of the limitation was that people are running Hive query very frequently, so they need a Hadoop cluster that is running all the time.

They use Hive extensively for analytic and reporting purposes.

Few comments on Ozzie – coordinator is neat. It defines work flow and dependency on the jobs. Ozzie XML is bad. You have to write a lot to just write a simple work flow in Ozzie.

Foursquare has been a Scala shop and they are able to use that Scala code for MR by using scoobi connector.

They converted some ruby streaming to pig + Scala UDFs

They keep the production reference data in MongoDB. They has been using mongoDB from beginning of the project.

Mongodb uses BSON which is a binary representation of JSON – it is schema-less. They wrote the program to convert the raw Bson Into hive serDe and scoobi for Scala.

Here is the pipeline –
AWS EBS – BSON – BSON Object – thrift (scala codegen) – input format for thrift objects to use in mr – hive serDe and scoobi for scala

Data Joins in map reduce are cumbersome.
Solution -do at once

They join the internal data table – venue, checking, tips and likes into one table and use that in multiple MR.

Future plan –

Hcatelog – makes hive table available to pig and map reduce jobs.

Indexing/ hive indexing.

Replacing google spreadsheet for proving much richer user interface.

Flume upgrade to flumeNG.

( All contents here are my own summary of talk, not endorsed or approved by any other party)

Leave a comment