Your SlideShare is downloading. ×
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop, Pig, HBase at Twitter
      • Dmitriy Ryaboy
      • Twitter Analytics
      • @squarecog
  • 2. Who is this guy, anyway
    • LBNL : Genome alignment & analysis
    • : Click log data warehousing
    • CMU : MS in “Very Large Information Systems”
    • Cloudera : graduate student intern
    • Twitter : Hadoop, Pig, Big Data, ...
    • Pig committer.
  • 3. In This Talk
    • Focus on Hadoop parts of data pipeline
    • Data movement
    • HBase
    • Pig
    • A few tips
  • 4. Not In This Talk
    • Cassandra
    • FlockDB
    • Gizzard
    • Memcached
    • Rest of Twitter’s NoSQL Bingo card
  • 5. Daily workload
    • 1000 s of Front End machines
    • 3 Billion API requests
    • 7 TB of ingested data
    • 20,000 Hadoop jobs
    • 55 Million tweets
    • Tweets only 0.5% of data
  • 6. Twitter data pipeline (simplified)
    • Front Ends update DB cluster. Scheduled DB exports to HDFS
    • Front Ends, Middleware, Backend services write logs
    • Scribe pipes logs straight into HDFS
    • Various other data source exports into HDFS
    • Daemons populate work queues as new data shows up
    • Daemons (and cron) pull work off queues, schedule MR and Pig jobs
    • Pig wrapper pushes results into MySQL for reports and dashboards
  • 7. Logs
    • Apache HTTP, W3C, JSON and Protocol Buffers
    • Each category goes into its own directory on HDFS
    • Everything is LZO compressed.
    • You need to index LZO files to make them splittable.
    • We use a patched version of Hadoop LZO libraries
    • See
  • 8. Tables
    • Users, tweets, geotags, trends, registered devices, etc
    • Automatic generation of protocol buffer definitions from SQL tables
    • Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers
    • See Elephant-Bird:
  • 9. ETL
    • "Crane", config driven, protocol buffer powered.
    • Sources/Sinks: HDFS, HBase, MySQL tables, web services
    • Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>
  • 10. HBase
  • 11. Mutability
    • Logs are immutable; HDFS is great.
    • Tables have mutable data.
    • Ignore updates? bad data
    • Pull updates, resolve at read time? Pain, time.
    • Pull updates, resolve in batches? Pain, time.
    • Let someone else do the resolving? Helloooo, HBase!
    • Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes.
    • Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet.
    • That being said, several services rely on HBase already.
  • 12. Aren't you guys Cassandra poster boys? poster boys?
    • YES but
    • Rough analogy: Cassandra is OLTP and HBase is OLAP
    • Cassandra used when we need low-latency, single-key reads and writes
    • HBase scans much more powerful
    • HBase co-locates data on the Hadoop cluster.
  • 13. HBase schema for MySQL exports, v1.
    • Want to query by created_at range, by updated_at range, and / or by user_id.
    • Key: [created_at, id]
    • CF: &quot;columns&quot;
    • Configs specify which columns to pull out and store explicitly.
    • Useful for indexing, cheap (HBase-side) filtering
    • CF: &quot;protobuf&quot;
    • A single column, contains serialized protocol buffer.
  • 14. HBase schema v1, cont.
    • Pro: easy to query by created_at range
    • Con: hard to pull out specific users (requires a full scan)
    • Con: hot spot at the last region for writes
    • Idea: put created_at into 'columns' CF, make user_id key
    • BUT ids mostly sequential; still a hot spot at the end of the table
    • Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.
  • 15. HBase schema, v2.
    • Key: inverted Id. Bottom bits are random. Ahh, finally, distribution.
    • Date range queries: new CF, 'time'
    • keep all versions of this CF
    • When specific time range needed, use index on the time column
    • Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record
  • 16. Pig
  • 17. Why Pig?
    • Much faster to write than vanilla MR
    • Step-by-step iterative expression of data flows intuitive to programmers
    • SQL support coming for those who prefer SQL (PIG-824)
    • Trivial to write UDFs
    • Easy to write Loaders (Even better with 0.7!)
    • For example, we can write Protobuf and HBase loaders...
    • Both in Elephant-Bird
  • 18. HBase Loader enhancements
    • Data expected to be binary, not String representations
    • Push down key range filters
    • Specify row caching (memory / speed tradeoff)
    • Optionally load the key
    • Optionally limit rows per region
    • Report progress
    • Haven't observed significant overhead vs. HBase scanning
  • 19. HBase Loader TODOs
    • Expose better control of filters
    • Expose timestamp controls
    • Expose Index hints (IHBase)
    • Automated filter and projection push-down (once on 0.7)
    • HBase Storage
  • 20. Elephant Bird
    • Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers
    • Starting to work on same for Thrift
    • HBase Loader
    • assorted UDFs
  • 21. Assorted Tips
  • 22. Bad records kill jobs
    • Big data is messy.
    • Catch exceptions, increment counter, return null
    • Deal with potential nulls
    • Far preferable to a single bad record bringing down the whole job
  • 23. Runaway UDFs kill jobs
    • Regex over a few billion tweets, most return in milliseconds.
    • 8 cause the regex to take more than 5 minutes , task gets reaped.
    • You clever twitterers, you.
    • MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time.
    • Plan to contribute to Pig, add to ElephantBird. May build into Pig internals.
  • 24. Use Counters
    • Use counters. Count everything.
    • UDF invocations, parsed records, unparsable records, timed-out UDFs...
    • Hook into cleanup phases and store counters to disk, next to data, for future analysis
    • Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.
  • 25.
    • At first: converted Protocol Buffers into Pig tuples at read time.
    • Moved to a Tuple wrapper that deserializes fields upon request.
    • Huge performance boost for wide tables with only a few used columns
    Lazy deserializaton FTW lazy deserialization
  • 26. Also see
  • 27. Questions ? Follow me at TM
  • 28. Photo Credits
    • Bingo:
    • Sandhill Crane:
    • Oakland Cranes: