Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Evolution in HBase

4,056 views

Published on

Speakers: Eric Czech and Alec Zopf (Next Big Sound)

Managing the evolution of data within HBase over time is not easy: Data resulting from Hadoop processing pipelines or otherwise placed in HBase is subject to the same kinds of oversights, bugs, and faulty assumptions inherent to the software that creates it. While the development of this software is often effectively managed through revision control systems, data itself is rarely modeled in a way that affords the same flexibility. In this session, we'll talk about how to build a versioned, time-series data store using HBase that can provide significantly greater adaptability and performance than similar systems.

Published in: Software, Technology
  • Be the first to comment

Data Evolution in HBase

  1. 1. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Building a Data “Development” Platform Data Evolution In HBase Eric Czech & Alec Zopf Next Big Sound ! HBaseCon - Case Studies Track May 5, 2014
  2. 2. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Intro • Eric Czech - Chief Architect Previously worked for infrastructure team at quantitative hedge fund ! • Alec Zopf - Senior Data Engineer Previously worked on algorithmic futures and options trading platform
  3. 3. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Agenda • Data & Architecture • Data Aggregation - Why no tools help us • Data Development (HBlocks) - Our platform for making it happen • A Practical Example
  4. 4. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Misc iTunes Physical Sales Amazon Sitecatalyst Facebook Facebook Insights Last.fm Pandora Rdio ReverbNation SoundCloud Tumblr Streaming & SocialNext Big Sound marries billions of public social data points with customers’ internal transactional data. Public sources include up to 3+ years of historical and competitive data for hundreds of thousands of artists and millions of songs. Google Analytics Wikipedia Tunesat Mediabase Sales Spotify Twitter Vevo Vimeo YouTube YouTube Analytics Deezer Instagram Data Sources
  5. 5. eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
  6. 6. eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc.
  7. 7. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Charts Licensed to Billboard In Billboard’s 118 year history they’ve licensed data from two providers – Nielsen in 1991 and Next Big Sound in 2010.
  8. 8. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Architecture & Stats •Data collected from 60+ sources •1M artists, 10M tracks •10s of billions of records •CDH 4.3.0 •48 node Hadoop cluster for 35TB dataset •No licensing costs •Giant counting machine!
  9. 9. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Data Aggregation Stores raw fact tables and copies of dimension tables from MySQL HDFS Oozie/Pig HBase Runs incremental joins of fact and dimension tables Stores timeseries aggregations for random access (NOT using counters)
  10. 10. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Raw Fact Data (HDFS) Aggregate Tables (HBase) Cube/Rollup Operations (Pig) (and many more...)
  11. 11. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Other Solutions • OpenTSDB • Summingbird (Twitter) • DataFu Hourglass (Linkedin) • Blueflood (Rackspace) • Oozie Coordinators • Apache Accumulo Are there better ways to just count things? Yes! Lots: • Hadoop + Voldemort • MongoDB Incremental MapReduce • TempoDB & InfluxDB (hosted services) • KairosDB (originally built on Cassandra) • Amazon EMR/Redshift • Cassandra/Redis/Riak/HBase Counters
  12. 12. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Considerations • Scalability • Cost • Performance • Client Libraries • I/O Characteristics • Optimal Hardware • Config Overhead • Language • Community • Data Model • Monitoring/Alerting • Documentation • Support • Learning Curve
  13. 13. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. One More Thing.. What about mistakes?! Data “bugs” are nearly impossible to predict and can screw you in unimaginable ways..
  14. 14. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Data Bugs Why are fan counts in Schenectady, NY 1000% higher than everywhere else? Data source uses 12345 as default for new users’ locations Why are radio station play numbers recently all multiples of 2 or 3? Data delivered several times and we had no idea Why is the number of songs sold 3% too high? We didn't account for returns Why are all the page view spikes 8 hours after they should be? We assumed UTC timestamps instead of PST Hundreds of these! .. that we know of
  15. 15. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Minor Data Bugs Georgia != Georgia
  16. 16. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Or maybe not... Can we just fix the code and re-aggregate? NO, there’s no guarantee that the bad data is overwritten. Can we do the aggregations “on-the-fly”? NO, we’re not using a relational model for good reason. Can we rebuild everything in new tables? NO, we’d need 2x storage to fix < .0001% of the data.
  17. 17. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Fixing data bugs online is terrifying. • Dangerous and complicated • Difficult to generalize • Time-consuming to test • A huge database I/O burden “Ad-hoc” updates to production datasets are: Learning the Hard Way
  18. 18. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Back To Solutions What if each dataset had multiple versions? ... and we can focus on small pieces ... with alpha/beta/stable tags ... where users only see what they should Feels familiar
  19. 19. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. HBlocks • Spans HDFS, Hive, Pig, and HBase • Arbitrary versioning of data subsets • Incremental processing, full-scale re-processing, and everything in between • Append-only model (deletes in background) Our solution for large-scale revision control
  20. 20. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. The Basics Each raw file has an ID * e.g “block_1” Each ID has versions * ID & version stored in HBase Version state used to filter results
  21. 21. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Data Development Version “States” control data lifecycle PENDING New data for ETL pipeline PROCESSING Data currently being processed ALPHA Developers only BETA Privileged users STABLE Everybody HIDDEN Ignored (but still in HBase) DELETED Removed permanently Birth Death
  22. 22. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. A Practical Example Tracking the number of English Language Wikipedia page views for Hadoop http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/ http://en.wikipedia.org/wiki/Apache_Hadoop So we’ll track this site: Using this data:
  23. 23. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. The Dataset http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/ Contains ~100MB compressed files for each hour pagecounts-20140101-*.gzAll pageviews for Jan 1, 2014:
  24. 24. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. File Uploads user@host001> for file in `ls wikipedia`! do! hblocks upload ! -file $file ! -source wikipedia ! done user@host001> ls wikipedia! pagecounts-20140101.gz! pagecounts-20140102.gz! ...! pagecounts-20140131.gz Files downloaded anywhere ... ... and uploaded to HDFS
  25. 25. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. File Metadata user@host001> hblocks list -source wikipedia ! +---------------------------------------------------------+! | hblock_id | hblock_name | source | version:1 |! +---------------------------------------------------------+! | 2935 | pagecounts-20140101 | wikipedia | PENDING |! | 2936 | pagecounts-20140102 | wikipedia | PENDING |! ...! | 3678 | pagecounts-20140131 | wikipedia | PENDING |! +---------------------------------------------------------+! Table contains 31 row(s) HDFS files registered in HBlocks metadata: “PENDING” state indicates availability for Pig scripts
  26. 26. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Run It! Now, lets do some aggregating: user@host001> hblocks aggregate -source wikipedia user@host001> hblocks query -table page_views ! +-------------------------------------------------------------------+! | hblock_id | version | language | page | date | value |! +-------------------------------------------------------------------+! | 2935 | 1 | en | Apache_Hadoop | 20140101 | 283 |! ...! | 2935 | 1 | En | Apache_Hadoop | 20140131 | 2 |! | 2935 | 1 | en.mw | Apache_Hadoop | 20140131 | 3 | Pig script writes results to HBase: Wtf is this !?
  27. 27. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. What Happened? • “Sub” languages (e.g. ‘en.mw’) introduced • Capitalized languages (e.g. ‘En’) also added • Aggregation script starts ignoring small % of records On January 20th: * fictitious problems - these language values are real but were not introduced in January
  28. 28. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Effects Over Time Aggregation process misses new languages causing slight drop in values
  29. 29. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Fix It! Create new versions for each file affected: user@host001> hblocks rebuild -source wikipedia -regex ‘.*201401(2|3).*’ Old versions “STABLE”, new versions “PENDING”: user@host001> hblocks list -source wikipedia ! +---------------------------------------------------------------------+! | hblock_id | hblock_name | source | version:1 | version:2 |! +---------------------------------------------------------------------+! | 2935 | pagecounts-20140101 | wikipedia | STABLE | |! | 2935 | pagecounts-20140102 | wikipedia | STABLE | |! ...! | 2936 | pagecounts-20140120 | wikipedia | STABLE | PENDING |! | 2936 | pagecounts-20140121 | wikipedia | STABLE | PENDING |! ...! | 3678 | pagecounts-20140131 | wikipedia | STABLE | PENDING |! +---------------------------------------------------------------------+
  30. 30. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Fix It! Change the current aggregation code: String language = line.get(“language”); To handle case-sensitivity and use first part before a “.”: String language = line.get(“language”)! ! .split(“.”)[1]! ! .toLowerCase();
  31. 31. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Run It Again Run the same aggregation for new versions: user@host001> hblocks aggregate -source wikipedia New results: We made it even worse!
  32. 32. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Revert Hurry, hide the bad data: .split(“.”)[1] Wrong! Should have been: .split(“.”)[0] user@host001> hblocks update_versions -source wikipedia ! ! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘HIDDEN’ Phew, back to where we started .. but what happened?
  33. 33. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Fix It Again (carefully) user@host001> hblocks rebuild -source wikipedia ! ! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘beta’ Rebuild aggregations in ‘beta’ state this time: hblocks aggregateAfter another only developers see: Looks good!
  34. 34. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Finishing Up Make the new data available for ALL users: Final state: user@host001> hblocks update_versions -source wikipedia ! ! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘ACTIVE’ user@host001> hblocks list -source wikipedia ! +---------------------------------------------------------------------------------+! | hblock_id | hblock_name | source | version:1 | version:2 | version:3 |! +---------------------------------------------------------------------------------+! | 2935 | pagecounts-20140101 | wikipedia | STABLE | | |! | 2935 | pagecounts-20140102 | wikipedia | STABLE | | |! ... ! | 2936 | pagecounts-20140120 | wikipedia | HIDDEN | HIDDEN | STABLE |! | 2936 | pagecounts-20140121 | wikipedia | HIDDEN | HIDDEN | STABLE |! ...! | 3678 | pagecounts-20140131 | wikipedia | HIDDEN | HIDDEN | STABLE |! +---------------------------------------------------------------------------------+
  35. 35. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. HBase Schema Primary Dimensions HBlock Id Time 0 Secondary Dimensions Time 1 HBlockVersion Id Time 2.0 Value0 Time 2.N Value N Keys Columns Values Timestamps Schema #Insertion Time (secs) Value Data Type
  36. 36. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. HBase Keys/Columns Primary Dimensions HBlock Id Time 0 Secondary Dimensions Time 1 HBlockVersion Id Keys Columns Concatenated string ids artists, tracks & metrics Times split into offsets limits row width Queried in bulk demographics & zip codes HBlocks metadata determines record “state”
  37. 37. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. HBase Values Time 2.0 Value0 Time 2.N Value NValues Time offsets in values too fixed width (single byte) Values stored as VarInts can be any width Many values per cell keeps key count lower, reducing MemStore size * difficult without an append-only model like ours
  38. 38. ® eric@nextbigsound.com© 2009 - 2014 Next Big Sound, Inc. Alec Zopf alec@nextbigsound.com Eric Czech eric@nextbigsound.com Architecture @ NBS - highscalability.com HBlocks White PaperJobs @ NBS Links

×