Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Upcoming SlideShare
Loading in...5
×
 

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter

on

  • 28,822 views

 

Statistics

Views

Total Views
28,822
Views on SlideShare
12,210
Embed Views
16,612

Actions

Likes
17
Downloads
260
Comments
0

27 Embeds 16,612

http://tech.backtype.com 15752
http://developer.yahoo.com 296
http://developer.yahoo.net 294
http://www.slideshare.net 135
http://yahoohadoop.tumblr.com 43
https://developer.yahoo.com 19
http://static.slidesharecdn.com 13
http://web.archive.org 9
http://nosql.mypopescu.com 8
http://blog.iband.kr 8
http://translate.googleusercontent.com 8
https://twitter.com 6
http://webcache.googleusercontent.com 5
https://www.tumblr.com 2
http://www.backtype.com 2
https://www.x-ploited.net 1
http://www.hanrss.com 1
http://www.readpath.com 1
http://devpub.kr 1
http://computerrepairkansascity.typepad.com 1
http://computerhelpkansascity.blogspot.com 1
https://twimg0-a.akamaihd.net 1
http://hghltd.yandex.net 1
http://tech.backtype.com} {832122390|||pingback 1
http://tech.backtype.com} {1383038582|||pingback 1
http://posterous.com 1
http://rdbcci 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter Presentation Transcript

  • 1. Hadoop, Pig, HBase at Twitter
      • Dmitriy Ryaboy
      • Twitter Analytics
      • @squarecog
  • 2. Who is this guy, anyway
    • LBNL : Genome alignment & analysis
    • Ask.com : Click log data warehousing
    • CMU : MS in “Very Large Information Systems”
    • Cloudera : graduate student intern
    • Twitter : Hadoop, Pig, Big Data, ...
    • Pig committer.
  • 3. In This Talk
    • Focus on Hadoop parts of data pipeline
    • Data movement
    • HBase
    • Pig
    • A few tips
  • 4. Not In This Talk
    • Cassandra
    • FlockDB
    • Gizzard
    • Memcached
    • Rest of Twitter’s NoSQL Bingo card
  • 5. Daily workload
    • 1000 s of Front End machines
    • 3 Billion API requests
    • 7 TB of ingested data
    • 20,000 Hadoop jobs
    • 55 Million tweets
    • Tweets only 0.5% of data
  • 6. Twitter data pipeline (simplified)
    • Front Ends update DB cluster. Scheduled DB exports to HDFS
    • Front Ends, Middleware, Backend services write logs
    • Scribe pipes logs straight into HDFS
    • Various other data source exports into HDFS
    • Daemons populate work queues as new data shows up
    • Daemons (and cron) pull work off queues, schedule MR and Pig jobs
    • Pig wrapper pushes results into MySQL for reports and dashboards
  • 7. Logs
    • Apache HTTP, W3C, JSON and Protocol Buffers
    • Each category goes into its own directory on HDFS
    • Everything is LZO compressed.
    • You need to index LZO files to make them splittable.
    • We use a patched version of Hadoop LZO libraries
    • See http://github.com/kevinweil/hadoop-lzo
  • 8. Tables
    • Users, tweets, geotags, trends, registered devices, etc
    • Automatic generation of protocol buffer definitions from SQL tables
    • Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers
    • See Elephant-Bird: http://github.com/kevinweil/elephant-bird
  • 9. ETL
    • "Crane", config driven, protocol buffer powered.
    • Sources/Sinks: HDFS, HBase, MySQL tables, web services
    • Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>
  • 10. HBase
  • 11. Mutability
    • Logs are immutable; HDFS is great.
    • Tables have mutable data.
    • Ignore updates? bad data
    • Pull updates, resolve at read time? Pain, time.
    • Pull updates, resolve in batches? Pain, time.
    • Let someone else do the resolving? Helloooo, HBase!
    • Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes.
    • Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet.
    • That being said, several services rely on HBase already.
  • 12. Aren't you guys Cassandra poster boys? poster boys?
    • YES but
    • Rough analogy: Cassandra is OLTP and HBase is OLAP
    • Cassandra used when we need low-latency, single-key reads and writes
    • HBase scans much more powerful
    • HBase co-locates data on the Hadoop cluster.
  • 13. HBase schema for MySQL exports, v1.
    • Want to query by created_at range, by updated_at range, and / or by user_id.
    • Key: [created_at, id]
    • CF: &quot;columns&quot;
    • Configs specify which columns to pull out and store explicitly.
    • Useful for indexing, cheap (HBase-side) filtering
    • CF: &quot;protobuf&quot;
    • A single column, contains serialized protocol buffer.
  • 14. HBase schema v1, cont.
    • Pro: easy to query by created_at range
    • Con: hard to pull out specific users (requires a full scan)
    • Con: hot spot at the last region for writes
    • Idea: put created_at into 'columns' CF, make user_id key
    • BUT ids mostly sequential; still a hot spot at the end of the table
    • Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.
  • 15. HBase schema, v2.
    • Key: inverted Id. Bottom bits are random. Ahh, finally, distribution.
    • Date range queries: new CF, 'time'
    • keep all versions of this CF
    • When specific time range needed, use index on the time column
    • Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record
  • 16. Pig
  • 17. Why Pig?
    • Much faster to write than vanilla MR
    • Step-by-step iterative expression of data flows intuitive to programmers
    • SQL support coming for those who prefer SQL (PIG-824)
    • Trivial to write UDFs
    • Easy to write Loaders (Even better with 0.7!)
    • For example, we can write Protobuf and HBase loaders...
    • Both in Elephant-Bird
  • 18. HBase Loader enhancements
    • Data expected to be binary, not String representations
    • Push down key range filters
    • Specify row caching (memory / speed tradeoff)
    • Optionally load the key
    • Optionally limit rows per region
    • Report progress
    • Haven't observed significant overhead vs. HBase scanning
  • 19. HBase Loader TODOs
    • Expose better control of filters
    • Expose timestamp controls
    • Expose Index hints (IHBase)
    • Automated filter and projection push-down (once on 0.7)
    • HBase Storage
  • 20. Elephant Bird
    • Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers
    • Starting to work on same for Thrift
    • HBase Loader
    • assorted UDFs
    • http://www.github.com/kevinweil/elephant-bird
  • 21. Assorted Tips
  • 22. Bad records kill jobs
    • Big data is messy.
    • Catch exceptions, increment counter, return null
    • Deal with potential nulls
    • Far preferable to a single bad record bringing down the whole job
  • 23. Runaway UDFs kill jobs
    • Regex over a few billion tweets, most return in milliseconds.
    • 8 cause the regex to take more than 5 minutes , task gets reaped.
    • You clever twitterers, you.
    • MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time.
    • Plan to contribute to Pig, add to ElephantBird. May build into Pig internals.
  • 24. Use Counters
    • Use counters. Count everything.
    • UDF invocations, parsed records, unparsable records, timed-out UDFs...
    • Hook into cleanup phases and store counters to disk, next to data, for future analysis
    • Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.
  • 25.
    • At first: converted Protocol Buffers into Pig tuples at read time.
    • Moved to a Tuple wrapper that deserializes fields upon request.
    • Huge performance boost for wide tables with only a few used columns
    Lazy deserializaton FTW lazy deserialization
  • 26. Also see
    • http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
    • http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010
    • http://www.slideshare.net/al3x/building-distributed-systems-in-scala
    • http://www.slideshare.net/ryansking/scaling-twitter-with-cassandra
    • http://www.slideshare.net/nkallen/q-con-3770885
  • 27. Questions ? Follow me at twitter.com/squarecog TM
  • 28. Photo Credits
    • Bingo: http://www.flickr.com/photos/hownowdesign/2393662713/
    • Sandhill Crane: http://www.flickr.com/photos/dianeham/123491289/
    • Oakland Cranes: http://www.flickr.com/photos/clankennedy/2654213672/