Your SlideShare is downloading. ×
0
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

26,826

Published on

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
26,826
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
264
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop, Pig, HBase at Twitter <ul><ul><li>Dmitriy Ryaboy </li></ul></ul><ul><ul><li>Twitter Analytics </li></ul></ul><ul><ul><li>@squarecog </li></ul></ul>
  • 2. Who is this guy, anyway <ul><li>LBNL : Genome alignment & analysis </li></ul><ul><li>Ask.com : Click log data warehousing </li></ul><ul><li>CMU : MS in “Very Large Information Systems” </li></ul><ul><li>Cloudera : graduate student intern </li></ul><ul><li>Twitter : Hadoop, Pig, Big Data, ... </li></ul><ul><li>Pig committer. </li></ul>
  • 3. In This Talk <ul><li>Focus on Hadoop parts of data pipeline </li></ul><ul><li>Data movement </li></ul><ul><li>HBase </li></ul><ul><li>Pig </li></ul><ul><li>A few tips </li></ul>
  • 4. Not In This Talk <ul><li>Cassandra </li></ul><ul><li>FlockDB </li></ul><ul><li>Gizzard </li></ul><ul><li>Memcached </li></ul><ul><li>Rest of Twitter’s NoSQL Bingo card </li></ul>
  • 5. Daily workload <ul><li>1000 s of Front End machines </li></ul><ul><li>3 Billion API requests </li></ul><ul><li>7 TB of ingested data </li></ul><ul><li>20,000 Hadoop jobs </li></ul><ul><li>55 Million tweets </li></ul><ul><li>Tweets only 0.5% of data </li></ul>
  • 6. Twitter data pipeline (simplified) <ul><li>Front Ends update DB cluster. Scheduled DB exports to HDFS </li></ul><ul><li>Front Ends, Middleware, Backend services write logs </li></ul><ul><li>Scribe pipes logs straight into HDFS </li></ul><ul><li>Various other data source exports into HDFS </li></ul><ul><li>Daemons populate work queues as new data shows up </li></ul><ul><li>Daemons (and cron) pull work off queues, schedule MR and Pig jobs </li></ul><ul><li>Pig wrapper pushes results into MySQL for reports and dashboards </li></ul>
  • 7. Logs <ul><li>Apache HTTP, W3C, JSON and Protocol Buffers </li></ul><ul><li>Each category goes into its own directory on HDFS </li></ul><ul><li>Everything is LZO compressed. </li></ul><ul><li>You need to index LZO files to make them splittable. </li></ul><ul><li>We use a patched version of Hadoop LZO libraries </li></ul><ul><li>See http://github.com/kevinweil/hadoop-lzo </li></ul>
  • 8. Tables <ul><li>Users, tweets, geotags, trends, registered devices, etc </li></ul><ul><li>Automatic generation of protocol buffer definitions from SQL tables </li></ul><ul><li>Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers </li></ul><ul><li>See Elephant-Bird: http://github.com/kevinweil/elephant-bird </li></ul>
  • 9. ETL <ul><li>&quot;Crane&quot;, config driven, protocol buffer powered. </li></ul><ul><li>Sources/Sinks: HDFS, HBase, MySQL tables, web services </li></ul><ul><li>Protobuf-based transformations: chain sets of <input proto, output proto, transformation class> </li></ul>
  • 10. HBase
  • 11. Mutability <ul><li>Logs are immutable; HDFS is great. </li></ul><ul><li>Tables have mutable data. </li></ul><ul><li>Ignore updates? bad data </li></ul><ul><li>Pull updates, resolve at read time? Pain, time. </li></ul><ul><li>Pull updates, resolve in batches? Pain, time. </li></ul><ul><li>Let someone else do the resolving? Helloooo, HBase! </li></ul><ul><li>Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes. </li></ul><ul><li>Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet. </li></ul><ul><li>That being said, several services rely on HBase already. </li></ul>
  • 12. Aren't you guys Cassandra poster boys? poster boys? <ul><li>YES but </li></ul><ul><li>Rough analogy: Cassandra is OLTP and HBase is OLAP </li></ul><ul><li>Cassandra used when we need low-latency, single-key reads and writes </li></ul><ul><li>HBase scans much more powerful </li></ul><ul><li>HBase co-locates data on the Hadoop cluster. </li></ul>
  • 13. HBase schema for MySQL exports, v1. <ul><li>Want to query by created_at range, by updated_at range, and / or by user_id. </li></ul><ul><li>Key: [created_at, id] </li></ul><ul><li>CF: &quot;columns&quot; </li></ul><ul><li>Configs specify which columns to pull out and store explicitly. </li></ul><ul><li>Useful for indexing, cheap (HBase-side) filtering </li></ul><ul><li>CF: &quot;protobuf&quot; </li></ul><ul><li>A single column, contains serialized protocol buffer. </li></ul>
  • 14. HBase schema v1, cont. <ul><li>Pro: easy to query by created_at range </li></ul><ul><li>Con: hard to pull out specific users (requires a full scan) </li></ul><ul><li>Con: hot spot at the last region for writes </li></ul><ul><li>Idea: put created_at into 'columns' CF, make user_id key </li></ul><ul><li>BUT ids mostly sequential; still a hot spot at the end of the table </li></ul><ul><li>Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem. </li></ul>
  • 15. HBase schema, v2. <ul><li>Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. </li></ul><ul><li>Date range queries: new CF, 'time' </li></ul><ul><li>keep all versions of this CF </li></ul><ul><li>When specific time range needed, use index on the time column </li></ul><ul><li>Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record </li></ul>
  • 16. Pig
  • 17. Why Pig? <ul><li>Much faster to write than vanilla MR </li></ul><ul><li>Step-by-step iterative expression of data flows intuitive to programmers </li></ul><ul><li>SQL support coming for those who prefer SQL (PIG-824) </li></ul><ul><li>Trivial to write UDFs </li></ul><ul><li>Easy to write Loaders (Even better with 0.7!) </li></ul><ul><li>For example, we can write Protobuf and HBase loaders... </li></ul><ul><li>Both in Elephant-Bird </li></ul>
  • 18. HBase Loader enhancements <ul><li>Data expected to be binary, not String representations </li></ul><ul><li>Push down key range filters </li></ul><ul><li>Specify row caching (memory / speed tradeoff) </li></ul><ul><li>Optionally load the key </li></ul><ul><li>Optionally limit rows per region </li></ul><ul><li>Report progress </li></ul><ul><li>Haven't observed significant overhead vs. HBase scanning </li></ul>
  • 19. HBase Loader TODOs <ul><li>Expose better control of filters </li></ul><ul><li>Expose timestamp controls </li></ul><ul><li>Expose Index hints (IHBase) </li></ul><ul><li>Automated filter and projection push-down (once on 0.7) </li></ul><ul><li>HBase Storage </li></ul>
  • 20. Elephant Bird <ul><li>Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers </li></ul><ul><li>Starting to work on same for Thrift </li></ul><ul><li>HBase Loader </li></ul><ul><li>assorted UDFs </li></ul><ul><li>http://www.github.com/kevinweil/elephant-bird </li></ul>
  • 21. Assorted Tips
  • 22. Bad records kill jobs <ul><li>Big data is messy. </li></ul><ul><li>Catch exceptions, increment counter, return null </li></ul><ul><li>Deal with potential nulls </li></ul><ul><li>Far preferable to a single bad record bringing down the whole job </li></ul>
  • 23. Runaway UDFs kill jobs <ul><li>Regex over a few billion tweets, most return in milliseconds. </li></ul><ul><li>8 cause the regex to take more than 5 minutes , task gets reaped. </li></ul><ul><li>You clever twitterers, you. </li></ul><ul><li>MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time. </li></ul><ul><li>Plan to contribute to Pig, add to ElephantBird. May build into Pig internals. </li></ul>
  • 24. Use Counters <ul><li>Use counters. Count everything. </li></ul><ul><li>UDF invocations, parsed records, unparsable records, timed-out UDFs... </li></ul><ul><li>Hook into cleanup phases and store counters to disk, next to data, for future analysis </li></ul><ul><li>Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible. </li></ul>
  • 25. <ul><li>At first: converted Protocol Buffers into Pig tuples at read time. </li></ul><ul><li>Moved to a Tuple wrapper that deserializes fields upon request. </li></ul><ul><li>Huge performance boost for wide tables with only a few used columns </li></ul>Lazy deserializaton FTW lazy deserialization
  • 26. Also see <ul><li>http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter </li></ul><ul><li>http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 </li></ul><ul><li>http://www.slideshare.net/al3x/building-distributed-systems-in-scala </li></ul><ul><li>http://www.slideshare.net/ryansking/scaling-twitter-with-cassandra </li></ul><ul><li>http://www.slideshare.net/nkallen/q-con-3770885 </li></ul>
  • 27. Questions ? Follow me at twitter.com/squarecog TM
  • 28. Photo Credits <ul><li>Bingo: http://www.flickr.com/photos/hownowdesign/2393662713/ </li></ul><ul><li>Sandhill Crane: http://www.flickr.com/photos/dianeham/123491289/ </li></ul><ul><li>Oakland Cranes: http://www.flickr.com/photos/clankennedy/2654213672/ </li></ul>

×