Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter


Published on

Published in: Technology

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

  1. 1. Hadoop, Pig, HBase at Twitter <ul><ul><li>Dmitriy Ryaboy </li></ul></ul><ul><ul><li>Twitter Analytics </li></ul></ul><ul><ul><li>@squarecog </li></ul></ul>
  2. 2. Who is this guy, anyway <ul><li>LBNL : Genome alignment & analysis </li></ul><ul><li> : Click log data warehousing </li></ul><ul><li>CMU : MS in “Very Large Information Systems” </li></ul><ul><li>Cloudera : graduate student intern </li></ul><ul><li>Twitter : Hadoop, Pig, Big Data, ... </li></ul><ul><li>Pig committer. </li></ul>
  3. 3. In This Talk <ul><li>Focus on Hadoop parts of data pipeline </li></ul><ul><li>Data movement </li></ul><ul><li>HBase </li></ul><ul><li>Pig </li></ul><ul><li>A few tips </li></ul>
  4. 4. Not In This Talk <ul><li>Cassandra </li></ul><ul><li>FlockDB </li></ul><ul><li>Gizzard </li></ul><ul><li>Memcached </li></ul><ul><li>Rest of Twitter’s NoSQL Bingo card </li></ul>
  5. 5. Daily workload <ul><li>1000 s of Front End machines </li></ul><ul><li>3 Billion API requests </li></ul><ul><li>7 TB of ingested data </li></ul><ul><li>20,000 Hadoop jobs </li></ul><ul><li>55 Million tweets </li></ul><ul><li>Tweets only 0.5% of data </li></ul>
  6. 6. Twitter data pipeline (simplified) <ul><li>Front Ends update DB cluster. Scheduled DB exports to HDFS </li></ul><ul><li>Front Ends, Middleware, Backend services write logs </li></ul><ul><li>Scribe pipes logs straight into HDFS </li></ul><ul><li>Various other data source exports into HDFS </li></ul><ul><li>Daemons populate work queues as new data shows up </li></ul><ul><li>Daemons (and cron) pull work off queues, schedule MR and Pig jobs </li></ul><ul><li>Pig wrapper pushes results into MySQL for reports and dashboards </li></ul>
  7. 7. Logs <ul><li>Apache HTTP, W3C, JSON and Protocol Buffers </li></ul><ul><li>Each category goes into its own directory on HDFS </li></ul><ul><li>Everything is LZO compressed. </li></ul><ul><li>You need to index LZO files to make them splittable. </li></ul><ul><li>We use a patched version of Hadoop LZO libraries </li></ul><ul><li>See </li></ul>
  8. 8. Tables <ul><li>Users, tweets, geotags, trends, registered devices, etc </li></ul><ul><li>Automatic generation of protocol buffer definitions from SQL tables </li></ul><ul><li>Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers </li></ul><ul><li>See Elephant-Bird: </li></ul>
  9. 9. ETL <ul><li>&quot;Crane&quot;, config driven, protocol buffer powered. </li></ul><ul><li>Sources/Sinks: HDFS, HBase, MySQL tables, web services </li></ul><ul><li>Protobuf-based transformations: chain sets of <input proto, output proto, transformation class> </li></ul>
  10. 10. HBase
  11. 11. Mutability <ul><li>Logs are immutable; HDFS is great. </li></ul><ul><li>Tables have mutable data. </li></ul><ul><li>Ignore updates? bad data </li></ul><ul><li>Pull updates, resolve at read time? Pain, time. </li></ul><ul><li>Pull updates, resolve in batches? Pain, time. </li></ul><ul><li>Let someone else do the resolving? Helloooo, HBase! </li></ul><ul><li>Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes. </li></ul><ul><li>Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet. </li></ul><ul><li>That being said, several services rely on HBase already. </li></ul>
  12. 12. Aren't you guys Cassandra poster boys? poster boys? <ul><li>YES but </li></ul><ul><li>Rough analogy: Cassandra is OLTP and HBase is OLAP </li></ul><ul><li>Cassandra used when we need low-latency, single-key reads and writes </li></ul><ul><li>HBase scans much more powerful </li></ul><ul><li>HBase co-locates data on the Hadoop cluster. </li></ul>
  13. 13. HBase schema for MySQL exports, v1. <ul><li>Want to query by created_at range, by updated_at range, and / or by user_id. </li></ul><ul><li>Key: [created_at, id] </li></ul><ul><li>CF: &quot;columns&quot; </li></ul><ul><li>Configs specify which columns to pull out and store explicitly. </li></ul><ul><li>Useful for indexing, cheap (HBase-side) filtering </li></ul><ul><li>CF: &quot;protobuf&quot; </li></ul><ul><li>A single column, contains serialized protocol buffer. </li></ul>
  14. 14. HBase schema v1, cont. <ul><li>Pro: easy to query by created_at range </li></ul><ul><li>Con: hard to pull out specific users (requires a full scan) </li></ul><ul><li>Con: hot spot at the last region for writes </li></ul><ul><li>Idea: put created_at into 'columns' CF, make user_id key </li></ul><ul><li>BUT ids mostly sequential; still a hot spot at the end of the table </li></ul><ul><li>Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem. </li></ul>
  15. 15. HBase schema, v2. <ul><li>Key: inverted Id. Bottom bits are random. Ahh, finally, distribution. </li></ul><ul><li>Date range queries: new CF, 'time' </li></ul><ul><li>keep all versions of this CF </li></ul><ul><li>When specific time range needed, use index on the time column </li></ul><ul><li>Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record </li></ul>
  16. 16. Pig
  17. 17. Why Pig? <ul><li>Much faster to write than vanilla MR </li></ul><ul><li>Step-by-step iterative expression of data flows intuitive to programmers </li></ul><ul><li>SQL support coming for those who prefer SQL (PIG-824) </li></ul><ul><li>Trivial to write UDFs </li></ul><ul><li>Easy to write Loaders (Even better with 0.7!) </li></ul><ul><li>For example, we can write Protobuf and HBase loaders... </li></ul><ul><li>Both in Elephant-Bird </li></ul>
  18. 18. HBase Loader enhancements <ul><li>Data expected to be binary, not String representations </li></ul><ul><li>Push down key range filters </li></ul><ul><li>Specify row caching (memory / speed tradeoff) </li></ul><ul><li>Optionally load the key </li></ul><ul><li>Optionally limit rows per region </li></ul><ul><li>Report progress </li></ul><ul><li>Haven't observed significant overhead vs. HBase scanning </li></ul>
  19. 19. HBase Loader TODOs <ul><li>Expose better control of filters </li></ul><ul><li>Expose timestamp controls </li></ul><ul><li>Expose Index hints (IHBase) </li></ul><ul><li>Automated filter and projection push-down (once on 0.7) </li></ul><ul><li>HBase Storage </li></ul>
  20. 20. Elephant Bird <ul><li>Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers </li></ul><ul><li>Starting to work on same for Thrift </li></ul><ul><li>HBase Loader </li></ul><ul><li>assorted UDFs </li></ul><ul><li> </li></ul>
  21. 21. Assorted Tips
  22. 22. Bad records kill jobs <ul><li>Big data is messy. </li></ul><ul><li>Catch exceptions, increment counter, return null </li></ul><ul><li>Deal with potential nulls </li></ul><ul><li>Far preferable to a single bad record bringing down the whole job </li></ul>
  23. 23. Runaway UDFs kill jobs <ul><li>Regex over a few billion tweets, most return in milliseconds. </li></ul><ul><li>8 cause the regex to take more than 5 minutes , task gets reaped. </li></ul><ul><li>You clever twitterers, you. </li></ul><ul><li>MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time. </li></ul><ul><li>Plan to contribute to Pig, add to ElephantBird. May build into Pig internals. </li></ul>
  24. 24. Use Counters <ul><li>Use counters. Count everything. </li></ul><ul><li>UDF invocations, parsed records, unparsable records, timed-out UDFs... </li></ul><ul><li>Hook into cleanup phases and store counters to disk, next to data, for future analysis </li></ul><ul><li>Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible. </li></ul>
  25. 25. <ul><li>At first: converted Protocol Buffers into Pig tuples at read time. </li></ul><ul><li>Moved to a Tuple wrapper that deserializes fields upon request. </li></ul><ul><li>Huge performance boost for wide tables with only a few used columns </li></ul>Lazy deserializaton FTW lazy deserialization
  26. 26. Also see <ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li> </li></ul>
  27. 27. Questions ? Follow me at TM
  28. 28. Photo Credits <ul><li>Bingo: </li></ul><ul><li>Sandhill Crane: </li></ul><ul><li>Oakland Cranes: </li></ul>