Your SlideShare is downloading. ×
0
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and pig at twitter (oscon 2010)

8,352

Published on

A look into how Twitter uses Hadoop and Pig to build products and do analytics.

A look into how Twitter uses Hadoop and Pig to build products and do analytics.

Published in: Technology
0 Comments
34 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,352
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
314
Comments
0
Likes
34
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop and Pig @Twitter Kevin Weil -- @kevinweil Analytics Lead, Twitter TM Friday, July 23, 2010
  • 2. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, machine learning, visualization, social graph analysis, soon to be PBs data Friday, July 23, 2010
  • 4. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 5. Data is Getting Big ‣ NYSE: 1 TB/day ‣ Facebook: 20+ TB compressed/day ‣ CERN/LHC: 40 TB/day (15 PB/year) ‣ And growth is accelerating ‣ Need multiple machines, horizontal scalability Friday, July 23, 2010
  • 6. Hadoop ‣ Distributed file system (hard to store a PB) ‣ Fault-tolerant, handles replication, node failure, etc ‣ MapReduce-based parallel computation (even harder to process a PB) ‣ Generic key-value based computation interface allows for wide applicability Friday, July 23, 2010
  • 7. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000-node cluster ‣ Powerful: sorted a TB of random integers in 62 seconds ‣ Easy Packaging: Cloudera RPMs, DEBs Friday, July 23, 2010
  • 8. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 9. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 10. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 11. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 12. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 13. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 14. MapReduce Workflow Inputs ‣ Challenge: how many tweets per Map Shuffle/Sort user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, Map Reduce value=1 Map Reduce ‣ Shuffle: sort by user_id Map ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count ‣ With 2x machines, runs 2x faster Friday, July 23, 2010
  • 15. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins are lengthy, error-prone ‣ Hard to manage n-stage jobs ‣ Exploration requires compilation! Friday, July 23, 2010
  • 16. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 17. Enter Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL? ‣ Top-level Apache project Friday, July 23, 2010
  • 18. Why Pig? ‣ Because I bet you can read the following script. Friday, July 23, 2010
  • 19. A Real Pig Script Friday, July 23, 2010
  • 20. Now, just for fun... ‣ The same calculation in vanilla MapReduce Friday, July 23, 2010
  • 21. No, seriously. Friday, July 23, 2010
  • 22. Pig Democratizes Large-scale Data Analysis ‣ The Pig version is: ‣ 5% of the code ‣ 5% of the development time ‣ Within 25% of the execution time ‣ Readable, reusable Friday, July 23, 2010
  • 23. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration Friday, July 23, 2010
  • 24. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 25. MySQL, MySQL, MySQL ‣ We all start there. ‣ But MySQL is not built for analysis. ‣ select count(*) from users? Maybe. ‣ select count(*) from tweets? Uh... ‣ Imagine joining them. ‣ And grouping. ‣ Then sorting. Friday, July 23, 2010
  • 26. Non-Pig Hadoop at Twitter ‣ Data Sink via Scribe ‣ Distributed Grep ‣ A few performance-critical, simple jobs ‣ People Search Friday, July 23, 2010
  • 27. People Search? ‣ First real product built with Hadoop ‣ “Find People” ‣ Old version: offline process on a single node ‣ New version: complex graph calculations, hit internal network services, custom indexing ‣ Faster, more reliable, more observable Friday, July 23, 2010
  • 28. People Search ‣ Import user data into HBase ‣ Periodic MapReduce job reading from HBase ‣ Hits FlockDB, other internal services in mapper ‣ Custom partitioning ‣ Data sucked across to sharded, replicated, horizontally scalable, in-memory, low-latency Scala service ‣ Build a trie, do case folding/normalization, suggestions, etc Friday, July 23, 2010
  • 29. Agenda ‣ Hadoop Overview ‣ Pig: Rapid Learning Over Big Data ‣ Data-Driven Products ‣ Hadoop/Pig and Analytics Friday, July 23, 2010
  • 30. Order of Operations ‣ Counting ‣ Correlating ‣ Research/ Algorithmic Learning Friday, July 23, 2010
  • 31. Counting ‣ How many requests per day? ‣ What’s the average latency? 95% latency? ‣ What’s the response code distribution? ‣ How many searches per day? Unique users? ‣ What’s the geographic breakdown of requests? ‣ How many tweets? From what clients? ‣ How many signups? Profile completeness? ‣ How many SMS notifications did we send? Friday, July 23, 2010
  • 32. Correlating ‣ How does usage differ for mobile users? ‣ ... for desktop client users (Tweetdeck, etc)? ‣ Cohort analyses ‣ What services fail at the same time? ‣ What features get users hooked? ‣ What do successful users do often? ‣ How does tweet volume change over time? Friday, July 23, 2010
  • 33. Research ‣ What can we infer from a user’s tweets? ‣ ... from the tweets of their followers? followees? ‣ What features tend to get a tweet retweeted? ‣ ... and what influences the retweet tree depth? ‣ Duplicate detection, language detection ‣ What graph structures lead to increased usage? ‣ Sentiment analysis, entity extraction ‣ User reputation Friday, July 23, 2010
  • 34. If We Had More Time... ‣ HBase ‣ LZO compression and Hadoop ‣ Protocol buffers ‣ Our open source: hadoop-lzo, elephant-bird ‣ Analytics and Cassandra Friday, July 23, 2010
  • 35. Questions? Follow me at twitter.com/kevinweil TM Friday, July 23, 2010

×