Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter
Upcoming SlideShare
Loading in...5

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter






Total Views
Slideshare-icon Views on SlideShare
Embed Views



25 Embeds 16,562 15752 296 294 135 15 13 9 8 8 8 5 5 2
http://rdbcci 1 1 1 1 1 1 1} {832122390|||pingback 1} {1383038582|||pingback 1 1 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem at Twitter, Dmitriy Ryaboy, Twitter Presentation Transcript

    • Hadoop, Pig, HBase at Twitter
        • Dmitriy Ryaboy
        • Twitter Analytics
        • @squarecog
    • Who is this guy, anyway
      • LBNL : Genome alignment & analysis
      • : Click log data warehousing
      • CMU : MS in “Very Large Information Systems”
      • Cloudera : graduate student intern
      • Twitter : Hadoop, Pig, Big Data, ...
      • Pig committer.
    • In This Talk
      • Focus on Hadoop parts of data pipeline
      • Data movement
      • HBase
      • Pig
      • A few tips
    • Not In This Talk
      • Cassandra
      • FlockDB
      • Gizzard
      • Memcached
      • Rest of Twitter’s NoSQL Bingo card
    • Daily workload
      • 1000 s of Front End machines
      • 3 Billion API requests
      • 7 TB of ingested data
      • 20,000 Hadoop jobs
      • 55 Million tweets
      • Tweets only 0.5% of data
    • Twitter data pipeline (simplified)
      • Front Ends update DB cluster. Scheduled DB exports to HDFS
      • Front Ends, Middleware, Backend services write logs
      • Scribe pipes logs straight into HDFS
      • Various other data source exports into HDFS
      • Daemons populate work queues as new data shows up
      • Daemons (and cron) pull work off queues, schedule MR and Pig jobs
      • Pig wrapper pushes results into MySQL for reports and dashboards
    • Logs
      • Apache HTTP, W3C, JSON and Protocol Buffers
      • Each category goes into its own directory on HDFS
      • Everything is LZO compressed.
      • You need to index LZO files to make them splittable.
      • We use a patched version of Hadoop LZO libraries
      • See
    • Tables
      • Users, tweets, geotags, trends, registered devices, etc
      • Automatic generation of protocol buffer definitions from SQL tables
      • Automatic generation of Hadoop Writables, Input / Output formats, Pig loaders from protocol buffers
      • See Elephant-Bird:
    • ETL
      • "Crane", config driven, protocol buffer powered.
      • Sources/Sinks: HDFS, HBase, MySQL tables, web services
      • Protobuf-based transformations: chain sets of <input proto, output proto, transformation class>
    • HBase
    • Mutability
      • Logs are immutable; HDFS is great.
      • Tables have mutable data.
      • Ignore updates? bad data
      • Pull updates, resolve at read time? Pain, time.
      • Pull updates, resolve in batches? Pain, time.
      • Let someone else do the resolving? Helloooo, HBase!
      • Bonus: various NoSQL bonuses, &quot;not just scans&quot;. Lookups, indexes.
      • Warning: we just started with HBase. This is all preliminary. Haven't tried indexes yet.
      • That being said, several services rely on HBase already.
    • Aren't you guys Cassandra poster boys? poster boys?
      • YES but
      • Rough analogy: Cassandra is OLTP and HBase is OLAP
      • Cassandra used when we need low-latency, single-key reads and writes
      • HBase scans much more powerful
      • HBase co-locates data on the Hadoop cluster.
    • HBase schema for MySQL exports, v1.
      • Want to query by created_at range, by updated_at range, and / or by user_id.
      • Key: [created_at, id]
      • CF: &quot;columns&quot;
      • Configs specify which columns to pull out and store explicitly.
      • Useful for indexing, cheap (HBase-side) filtering
      • CF: &quot;protobuf&quot;
      • A single column, contains serialized protocol buffer.
    • HBase schema v1, cont.
      • Pro: easy to query by created_at range
      • Con: hard to pull out specific users (requires a full scan)
      • Con: hot spot at the last region for writes
      • Idea: put created_at into 'columns' CF, make user_id key
      • BUT ids mostly sequential; still a hot spot at the end of the table
      • Transitioning to non-sequential ids; but their high bits are creation timestamp! Same problem.
    • HBase schema, v2.
      • Key: inverted Id. Bottom bits are random. Ahh, finally, distribution.
      • Date range queries: new CF, 'time'
      • keep all versions of this CF
      • When specific time range needed, use index on the time column
      • Keeping time in separate CF allows us to keep track of every time the record got updated, without storing all versions of the record
    • Pig
    • Why Pig?
      • Much faster to write than vanilla MR
      • Step-by-step iterative expression of data flows intuitive to programmers
      • SQL support coming for those who prefer SQL (PIG-824)
      • Trivial to write UDFs
      • Easy to write Loaders (Even better with 0.7!)
      • For example, we can write Protobuf and HBase loaders...
      • Both in Elephant-Bird
    • HBase Loader enhancements
      • Data expected to be binary, not String representations
      • Push down key range filters
      • Specify row caching (memory / speed tradeoff)
      • Optionally load the key
      • Optionally limit rows per region
      • Report progress
      • Haven't observed significant overhead vs. HBase scanning
    • HBase Loader TODOs
      • Expose better control of filters
      • Expose timestamp controls
      • Expose Index hints (IHBase)
      • Automated filter and projection push-down (once on 0.7)
      • HBase Storage
    • Elephant Bird
      • Auto-generate Hadoop Input/Output formats, Writables, Pig loaders for Protocol Buffers
      • Starting to work on same for Thrift
      • HBase Loader
      • assorted UDFs
    • Assorted Tips
    • Bad records kill jobs
      • Big data is messy.
      • Catch exceptions, increment counter, return null
      • Deal with potential nulls
      • Far preferable to a single bad record bringing down the whole job
    • Runaway UDFs kill jobs
      • Regex over a few billion tweets, most return in milliseconds.
      • 8 cause the regex to take more than 5 minutes , task gets reaped.
      • You clever twitterers, you.
      • MonitoredUDF wrapper kicks off a monitoring thread, kills a UDF and returns a default value if it doesn't return something in time.
      • Plan to contribute to Pig, add to ElephantBird. May build into Pig internals.
    • Use Counters
      • Use counters. Count everything.
      • UDF invocations, parsed records, unparsable records, timed-out UDFs...
      • Hook into cleanup phases and store counters to disk, next to data, for future analysis
      • Don't have it for Pig yet, but 0.8 adds metadata to job confs to make this possible.
      • At first: converted Protocol Buffers into Pig tuples at read time.
      • Moved to a Tuple wrapper that deserializes fields upon request.
      • Huge performance boost for wide tables with only a few used columns
      Lazy deserializaton FTW lazy deserialization
    • Also see
    • Questions ? Follow me at TM
    • Photo Credits
      • Bingo:
      • Sandhill Crane:
      • Oakland Cranes: