Your SlideShare is downloading. ×
0
2013
Building and
Improving Products
with Hadoop
Matthew Rathbone
2013
What is Foursquare
Foursquare helps you explore
the world around you.
Meet up with friends, discover
new places, and ...
2013
FIRST, A STORY
http://www.flickr.com/photos/shannonpatrick17
2013
The Right Tool for the Job
• Nginx – Serving static files
• Perl – Regular expressions
• XML – Frustrating people
• H...
2013
COUNTING – WHAT IS IT GOOD FOR
http://www.flickr.com/photos/blaahhi/
2013
2013
2013
2013
2013
2013
Statistically Improbable Phrases
Statistically Improbable Phrases
2013
SIPS use cases
• menu extraction
• sentiment analysis
• venue ratings
• specific recommendations
• search indexing
• ...
2013
How is SIPS built?
Basically lots of counting.
2013
SIPS
• Tokenize data with a language model (into N-
Grams)
• built using tips, shouts, menu items, likes, etc
• Apply...
2013
WHY USE HADOOP?
http://www.flickr.com/photos/dbrekke/
2013
SIPS – Without Hadoop
Potential Problems
• Database Query Throttling
• Venues are out of sync
• Altering the algorith...
2013
SIPS – Hadoop Benefits
• Quick Deployment
• Modular & Reusable
• Arbitrarily complex combination of many
datasets
• E...
2013
Apple Store - Downtown San Francisco
1 tip mentions "haircuts"
Search for "haircuts" in "san francisco"  Apple store...
2013
Data & Modularity
2013
2013
2013
2013
ACTUALLY, IT’S A BIT MORE
COMPLICATED
http://www.flickr.com/photos/bfishadow
2013
These benefits require infrastructure
2013
Dependency Management
Many options
• Oozie (Apache)
• Azkaban (LinkedIn)
• Luigi ( Spotify, we <3 this )
• Hamake ( C...
2013
2013
Database / Log Ingestion
• Sqoop
• Mongo-Hadoop
• Kafka
• Flume
• Scribe
• etc
2013
2013
MapReduce Friendly Datastore
A few obvious ones:
• Hbase
• Cassandra
• Voldemort
we built our own, it’s very similar ...
2013
2013
Getting started without all that stuff
2013
Components you likely don’t have
2013
The best way to start
Don’t use Hadoop.
*but pretend you do
2013
Other reasons to not use Hadoop
• Your idea might not be very good
• Hadoop will slow you down to start with
• You do...
2013
2013
2013
SIPS
Version 1
• Off the shelf language model
• A subset of Venues & Tips
• Did not use Map Reduce
• Did not push to ...
2013
SIPS
Version 2
• Started building our own language model
• Rewritten as a Map Reduce
• Manually loaded data to produc...
2013
SIPS
Version 3
• Incorporated more data sources into our language
model
• Deployment to KV store (auto)
• Incorporate...
2013
…to explore data
2013
In Summary
• Hadoop is good for counting, so use it for
counting
• Move quickly whenever possible and don’t
worry abo...
20132013
Thanks!
matthew@foursquare.com
@rathboma
Bonus:
http://hadoopweekly.com
from my colleague, Joe Crobak (presenting...
Upcoming SlideShare
Loading in...5
×

Building and Improving Products with Hadoop

596

Published on

In many instances the terms `big data` and `Hadoop` are reserved for conversations on business analytics. Instead, I posit that these technologies are most powerful when they are deployed as a way to both build new products, and improve existing ones. Measurement is a fundamental part of the process, but more importantly I will walk through an effective tool-chain that can be used to: a) build unique new products, based on data. b) test improvements to a product At Foursquare, we`ve used a Hadoop-based tool chain to build new products (like social-recommendations), and to improve existing features through initiatives such as experimentation, and offline data generation. These products and improvements are fundamental to our core business, yet their existence would not be possible without Hadoop. I will pull examples from Foursquare and other companies to demonstrate these points, and outline the infrastructure components needed to accomplish them.

Published in: Technology, Sports
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
596
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Friend – financeSpent 2 years building management platformScrapped the projectFund manager hired kid to build excel macrosRight tool for the job
  • Great for analyticsGreat for your products too
  • - tf-idf : counting globally, counting locally
  • Use lots of data sources without fearEach MR step outputs data to hdfs that can be used in other workflows.Makes the workflow naturally modulareasy to test isolated parts of the workflow
  • Once you’ve solved the MR -&gt; Datastore problem once, you’ve solved it for good.
  • Every task has requirementsOther tasksDirectories with _SUCCESS flagsRun on cron
  • - Hadoop-friendly Datastore-- we built our own (HFile Service)-- -- immutable-- -- downloads data from s3-- -- reads everything into memory (but doesn&apos;t need to)-- -- create X shards using map-reduce, swap these into X servers. They memory-map the files
  • when in production hadoop lets you iterate quicklyright now, it slows you downstill work offlinedo it without any of the important components I just told you about
  • build a MVP in a spreadsheet, webview, whatevereven if you deploy it, you can manually load data into a DB to start withIf you’re testing a v1 for a limited subset (employees), you probably don’t have much data anyway
  • This didn’t need any of the key infrastructure components
  • This needed database dumps.Ran on a cronLoaded manually
  • Needed database dumpsRun with our dependency management engineLoads to our production datastore
  • Transcript of "Building and Improving Products with Hadoop "

    1. 1. 2013 Building and Improving Products with Hadoop Matthew Rathbone
    2. 2. 2013 What is Foursquare Foursquare helps you explore the world around you. Meet up with friends, discover new places, and save money using your phone.  4bn check-ins  35mm users  50mm POI  150 employees  1tb+ a day of data
    3. 3. 2013 FIRST, A STORY http://www.flickr.com/photos/shannonpatrick17
    4. 4. 2013 The Right Tool for the Job • Nginx – Serving static files • Perl – Regular expressions • XML – Frustrating people • Hadoop (Map Reduce) – Counting
    5. 5. 2013 COUNTING – WHAT IS IT GOOD FOR http://www.flickr.com/photos/blaahhi/
    6. 6. 2013
    7. 7. 2013
    8. 8. 2013
    9. 9. 2013
    10. 10. 2013
    11. 11. 2013 Statistically Improbable Phrases Statistically Improbable Phrases
    12. 12. 2013 SIPS use cases • menu extraction • sentiment analysis • venue ratings • specific recommendations • search indexing • pricing data • facility information
    13. 13. 2013 How is SIPS built? Basically lots of counting.
    14. 14. 2013 SIPS • Tokenize data with a language model (into N- Grams) • built using tips, shouts, menu items, likes, etc • Apply a TF-IDF algorithm (Term frequency, inverse document frequency) • Global phrase count • Local phrase count ( in a venue ) • Some Filtering and ranking • Re-compute & deploy nightly
    15. 15. 2013 WHY USE HADOOP? http://www.flickr.com/photos/dbrekke/
    16. 16. 2013 SIPS – Without Hadoop Potential Problems • Database Query Throttling • Venues are out of sync • Altering the algorithm could take forever to populate for all venues • Where would you store the results? • What about debug data? • Does it scale to 10x, 100x? • What about other, similar workflows?
    17. 17. 2013 SIPS – Hadoop Benefits • Quick Deployment • Modular & Reusable • Arbitrarily complex combination of many datasets • Every step of the workflow creates value
    18. 18. 2013 Apple Store - Downtown San Francisco 1 tip mentions "haircuts" Search for "haircuts" in "san francisco"  Apple store??? Fixed by looking at % of tips and overall frequency “Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix- my-f!@#$-imac”
    19. 19. 2013 Data & Modularity
    20. 20. 2013
    21. 21. 2013
    22. 22. 2013
    23. 23. 2013 ACTUALLY, IT’S A BIT MORE COMPLICATED http://www.flickr.com/photos/bfishadow
    24. 24. 2013 These benefits require infrastructure
    25. 25. 2013 Dependency Management Many options • Oozie (Apache) • Azkaban (LinkedIn) • Luigi ( Spotify, we <3 this ) • Hamake ( Codeminders ) • Chronos ( AirBNB)
    26. 26. 2013
    27. 27. 2013 Database / Log Ingestion • Sqoop • Mongo-Hadoop • Kafka • Flume • Scribe • etc
    28. 28. 2013
    29. 29. 2013 MapReduce Friendly Datastore A few obvious ones: • Hbase • Cassandra • Voldemort we built our own, it’s very similar to Voldemort and uses the Hfile API
    30. 30. 2013
    31. 31. 2013 Getting started without all that stuff
    32. 32. 2013 Components you likely don’t have
    33. 33. 2013 The best way to start Don’t use Hadoop. *but pretend you do
    34. 34. 2013 Other reasons to not use Hadoop • Your idea might not be very good • Hadoop will slow you down to start with • You don’t have enough infrastructure yet • build it when you need it • V1 might not be that complex • V1 could be a spreadsheet
    35. 35. 2013
    36. 36. 2013
    37. 37. 2013 SIPS Version 1 • Off the shelf language model • A subset of Venues & Tips • Did not use Map Reduce • Did not push to production at all
    38. 38. 2013 SIPS Version 2 • Started building our own language model • Rewritten as a Map Reduce • Manually loaded data to production • Filters for English data only. Tweak, improve, etc
    39. 39. 2013 SIPS Version 3 • Incorporated more data sources into our language model • Deployment to KV store (auto) • Incorporated lots of debug output • Language pipeline also feeds sentiment analysis Now we’re in the perfect place to iterate & improve
    40. 40. 2013 …to explore data
    41. 41. 2013 In Summary • Hadoop is good for counting, so use it for counting • Move quickly whenever possible and don’t worry about automation • Bring in new production services as you need them • Freedom!
    42. 42. 20132013 Thanks! matthew@foursquare.com @rathboma Bonus: http://hadoopweekly.com from my colleague, Joe Crobak (presenting later!)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×