Hadoop, Pig, HBase at Twitter <ul><ul><li>Dmitriy Ryaboy  </li></ul></ul><ul><ul><li>Twitter Analytics </li></ul></ul><ul>...
Who is this guy, anyway <ul><li>LBNL : Genome alignment & analysis </li></ul><ul><li>Ask.com : Click log data warehousing ...
In This Talk <ul><li>Focus on Hadoop parts of data pipeline </li></ul><ul><li>Data movement </li></ul><ul><li>HBase </li><...
Not In This Talk <ul><li>Cassandra </li></ul><ul><li>FlockDB </li></ul><ul><li>Gizzard </li></ul><ul><li>Memcached </li></...
Daily workload <ul><li>1000 s  of Front End machines </li></ul><ul><li>3 Billion  API requests </li></ul><ul><li>7 TB  of ...
Twitter data pipeline (simplified) <ul><li>Front Ends update DB cluster.  Scheduled DB exports to HDFS </li></ul><ul><li>F...
Logs <ul><li>Apache HTTP, W3C, JSON and Protocol Buffers </li></ul><ul><li>Each category goes into its own directory on HD...
Tables <ul><li>Users, tweets, geotags, trends, registered devices, etc </li></ul><ul><li>Automatic generation of protocol ...
ETL <ul><li>&quot;Crane&quot;, config driven, protocol buffer powered.  </li></ul><ul><li>Sources/Sinks: HDFS, HBase, MySQ...
HBase
Mutability <ul><li>Logs are immutable; HDFS is great. </li></ul><ul><li>Tables have mutable data.  </li></ul><ul><li>Ignor...
Aren't you guys Cassandra poster boys? poster boys? <ul><li>YES   but </li></ul><ul><li>Rough analogy: Cassandra is OLTP a...
HBase schema for MySQL exports, v1. <ul><li>Want to query by created_at range, by updated_at range, and / or by user_id. <...
HBase schema v1, cont. <ul><li>Pro: easy to query by created_at range  </li></ul><ul><li>Con: hard to pull out specific us...
HBase schema, v2. <ul><li>Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. </li></ul><ul><li>Date ran...
Pig
Why Pig? <ul><li>Much faster to write than vanilla MR </li></ul><ul><li>Step-by-step iterative expression of data flows in...
HBase Loader enhancements <ul><li>Data expected to be binary, not String representations </li></ul><ul><li>Push down key r...
HBase Loader TODOs <ul><li>Expose better control of filters </li></ul><ul><li>Expose timestamp controls </li></ul><ul><li>...
Elephant Bird <ul><li>Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers </li></ul><ul...
Assorted Tips
Bad records kill jobs <ul><li>Big data is messy. </li></ul><ul><li>Catch exceptions, increment counter, return null </li><...
Runaway UDFs kill jobs <ul><li>Regex over a few billion tweets, most return in milliseconds. </li></ul><ul><li>8 cause the...
Use Counters <ul><li>Use counters. Count everything. </li></ul><ul><li>UDF invocations, parsed records, unparsable records...
<ul><li>At first: converted Protocol Buffers into Pig tuples at read time.  </li></ul><ul><li>Moved to a Tuple wrapper tha...
Also see <ul><li>http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter </li></ul><ul><li>http://www.s...
Questions ? Follow me at twitter.com/squarecog TM
Photo Credits <ul><li>Bingo:  http://www.flickr.com/photos/hownowdesign/2393662713/ </li></ul><ul><li>Sandhill Crane:  htt...
Upcoming SlideShare
Loading in...5
×

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

26,906

Published on

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
26,906
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
265
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

  1. 1. Hadoop, Pig, HBase at Twitter <ul><ul><li>Dmitriy Ryaboy </li></ul></ul><ul><ul><li>Twitter Analytics </li></ul></ul><ul><ul><li>@squarecog </li></ul></ul>
  2. 2. Who is this guy, anyway <ul><li>LBNL : Genome alignment & analysis </li></ul><ul><li>Ask.com : Click log data warehousing </li></ul><ul><li>CMU : MS in “Very Large Information Systems” </li></ul><ul><li>Cloudera : graduate student intern </li></ul><ul><li>Twitter : Hadoop, Pig, Big Data, ... </li></ul><ul><li>Pig committer. </li></ul>
  3. 3. In This Talk <ul><li>Focus on Hadoop parts of data pipeline </li></ul><ul><li>Data movement </li></ul><ul><li>HBase </li></ul><ul><li>Pig </li></ul><ul><li>A few tips </li></ul>
  4. 4. Not In This Talk <ul><li>Cassandra </li></ul><ul><li>FlockDB </li></ul><ul><li>Gizzard </li></ul><ul><li>Memcached </li></ul><ul><li>Rest of Twitter’s NoSQL Bingo card </li></ul>
  5. 5. Daily workload <ul><li>1000 s of Front End machines </li></ul><ul><li>3 Billion API requests </li></ul><ul><li>7 TB of ingested data </li></ul><ul><li>20,000 Hadoop jobs </li></ul><ul><li>55 Million tweets </li></ul><ul><li>Tweets only 0.5% of data </li></ul>
  6. 6. Twitter data pipeline (simplified) <ul><li>Front Ends update DB cluster. Scheduled DB exports to HDFS </li></ul><ul><li>Front Ends, Middleware, Backend services write logs </li></ul><ul><li>Scribe pipes logs straight into HDFS </li></ul><ul><li>Various other data source exports into HDFS </li></ul><ul><li>Daemons populate work queues as new data shows up </li></ul><ul><li>Daemons (and cron) pull work off queues, schedule MR and Pig jobs </li></ul><ul><li>Pig wrapper pushes results into MySQL for reports and dashboards </li></ul>
  7. 7. Logs <ul><li>Apache HTTP, W3C, JSON and Protocol Buffers </li></ul><ul><li>Each category goes into its own directory on HDFS </li></ul><ul><li>Everything is LZO compressed. </li></ul><ul><li>You need to index LZO files to make them splittable. </li></ul><ul><li>We use a patched version of Hadoop LZO libraries </li></ul><ul><li>See http://github.com/kevinweil/hadoop-lzo </li></ul>
  8. 8. Tables <ul><li>Users, tweets, geotags, trends, registered devices, etc </li></ul><ul><li>Automatic generation of protocol buffer definitions from SQL tables </li></ul><ul><li>Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers </li></ul><ul><li>See Elephant-Bird: http://github.com/kevinweil/elephant-bird </li></ul>
  9. 9. ETL <ul><li>&quot;Crane&quot;, config driven, protocol buffer powered. </li></ul><ul><li>Sources/Sinks: HDFS, HBase, MySQL tables, web services </li></ul><ul><li>Protobuf-based transformations: chain sets of <input proto, output proto, transformation class> </li></ul>
  10. 10. HBase
  11. 11. Mutability <ul><li>Logs are immutable; HDFS is great. </li></ul><ul><li>Tables have mutable data. </li></ul><ul><li>Ignore updates? bad data </li></ul><ul><li>Pull updates, resolve at read time? Pain, time. </li></ul><ul><li>Pull updates, resolve in batches? Pain, time. </li></ul><ul><li>Let someone else do the resolving? Helloooo, HBase! </li></ul><ul><li>Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes. </li></ul><ul><li>Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet. </li></ul><ul><li>That being said, several services rely on HBase already. </li></ul>
  12. 12. Aren't you guys Cassandra poster boys? poster boys? <ul><li>YES but </li></ul><ul><li>Rough analogy: Cassandra is OLTP and HBase is OLAP </li></ul><ul><li>Cassandra used when we need low-latency, single-key reads and writes </li></ul><ul><li>HBase scans much more powerful </li></ul><ul><li>HBase co-locates data on the Hadoop cluster. </li></ul>
  13. 13. HBase schema for MySQL exports, v1. <ul><li>Want to query by created_at range, by updated_at range, and / or by user_id. </li></ul><ul><li>Key: [created_at, id] </li></ul><ul><li>CF: &quot;columns&quot; </li></ul><ul><li>Configs specify which columns to pull out and store explicitly. </li></ul><ul><li>Useful for indexing, cheap (HBase-side) filtering </li></ul><ul><li>CF: &quot;protobuf&quot; </li></ul><ul><li>A single column, contains serialized protocol buffer. </li></ul>
  14. 14. HBase schema v1, cont. <ul><li>Pro: easy to query by created_at range </li></ul><ul><li>Con: hard to pull out specific users (requires a full scan) </li></ul><ul><li>Con: hot spot at the last region for writes </li></ul><ul><li>Idea: put created_at into 'columns' CF, make user_id key </li></ul><ul><li>BUT ids mostly sequential; still a hot spot at the end of the table </li></ul><ul><li>Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem. </li></ul>
  15. 15. HBase schema, v2. <ul><li>Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. </li></ul><ul><li>Date range queries: new CF, 'time' </li></ul><ul><li>keep all versions of this CF </li></ul><ul><li>When specific time range needed, use index on the time column </li></ul><ul><li>Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record </li></ul>
  16. 16. Pig
  17. 17. Why Pig? <ul><li>Much faster to write than vanilla MR </li></ul><ul><li>Step-by-step iterative expression of data flows intuitive to programmers </li></ul><ul><li>SQL support coming for those who prefer SQL (PIG-824) </li></ul><ul><li>Trivial to write UDFs </li></ul><ul><li>Easy to write Loaders (Even better with 0.7!) </li></ul><ul><li>For example, we can write Protobuf and HBase loaders... </li></ul><ul><li>Both in Elephant-Bird </li></ul>
  18. 18. HBase Loader enhancements <ul><li>Data expected to be binary, not String representations </li></ul><ul><li>Push down key range filters </li></ul><ul><li>Specify row caching (memory / speed tradeoff) </li></ul><ul><li>Optionally load the key </li></ul><ul><li>Optionally limit rows per region </li></ul><ul><li>Report progress </li></ul><ul><li>Haven't observed significant overhead vs. HBase scanning </li></ul>
  19. 19. HBase Loader TODOs <ul><li>Expose better control of filters </li></ul><ul><li>Expose timestamp controls </li></ul><ul><li>Expose Index hints (IHBase) </li></ul><ul><li>Automated filter and projection push-down (once on 0.7) </li></ul><ul><li>HBase Storage </li></ul>
  20. 20. Elephant Bird <ul><li>Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers </li></ul><ul><li>Starting to work on same for Thrift </li></ul><ul><li>HBase Loader </li></ul><ul><li>assorted UDFs </li></ul><ul><li>http://www.github.com/kevinweil/elephant-bird </li></ul>
  21. 21. Assorted Tips
  22. 22. Bad records kill jobs <ul><li>Big data is messy. </li></ul><ul><li>Catch exceptions, increment counter, return null </li></ul><ul><li>Deal with potential nulls </li></ul><ul><li>Far preferable to a single bad record bringing down the whole job </li></ul>
  23. 23. Runaway UDFs kill jobs <ul><li>Regex over a few billion tweets, most return in milliseconds. </li></ul><ul><li>8 cause the regex to take more than 5 minutes , task gets reaped. </li></ul><ul><li>You clever twitterers, you. </li></ul><ul><li>MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time. </li></ul><ul><li>Plan to contribute to Pig, add to ElephantBird. May build into Pig internals. </li></ul>
  24. 24. Use Counters <ul><li>Use counters. Count everything. </li></ul><ul><li>UDF invocations, parsed records, unparsable records, timed-out UDFs... </li></ul><ul><li>Hook into cleanup phases and store counters to disk, next to data, for future analysis </li></ul><ul><li>Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible. </li></ul>
  25. 25. <ul><li>At first: converted Protocol Buffers into Pig tuples at read time. </li></ul><ul><li>Moved to a Tuple wrapper that deserializes fields upon request. </li></ul><ul><li>Huge performance boost for wide tables with only a few used columns </li></ul>Lazy deserializaton FTW lazy deserialization
  26. 26. Also see <ul><li>http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter </li></ul><ul><li>http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 </li></ul><ul><li>http://www.slideshare.net/al3x/building-distributed-systems-in-scala </li></ul><ul><li>http://www.slideshare.net/ryansking/scaling-twitter-with-cassandra </li></ul><ul><li>http://www.slideshare.net/nkallen/q-con-3770885 </li></ul>
  27. 27. Questions ? Follow me at twitter.com/squarecog TM
  28. 28. Photo Credits <ul><li>Bingo: http://www.flickr.com/photos/hownowdesign/2393662713/ </li></ul><ul><li>Sandhill Crane: http://www.flickr.com/photos/dianeham/123491289/ </li></ul><ul><li>Oakland Cranes: http://www.flickr.com/photos/clankennedy/2654213672/ </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×