Hfile

5,571 views

Published on

My slides for Huguk #7

Published in: Technology
1 Comment
12 Likes
Statistics
Notes
No Downloads
Views
Total views
5,571
On SlideShare
0
From Embeds
0
Number of Embeds
1,257
Actions
Shares
0
Downloads
72
Comments
1
Likes
12
Embeds 0
No embeds

No notes for slide

Hfile

  1. 1. Using HFile outside  of HBase   Marc de Palol (marc@last.fm) Huguk #7 19th November 1
  2. 2. Context: What’s Last.fm? Last.fm is a: – Music discovery website. – powered by scrobbling. – that provides personalized radio.  – And lots, lots of stats and informaLon about arLsts. 2
  3. 3. You can find numbers all around! 3
  4. 4. You can find numbers all around! 4
  5. 5. You can find numbers all around! 5
  6. 6. You can find numbers all around! 6
  7. 7. How does this work ?  7 Scrobble  Server Chartserver Chartserver Memca che Web  Nodes scrobble Hadoop Web / API Users
  8. 8. How does this work ?  8 Scrobble  Server Chartserver Chartserver Memca che Web  Nodes scrobble Hadoop Web / API Users
  9. 9. Closer look at this guy (chartserver) • Java (used to be PHP). • ThriZ API. • Text file format + index. • Disk I/O is the problem. 9 • Not only a Key‐Value store. • It serves nearly all the data  we generate with Hadoop.
  10. 10. Closer look at this guy (chartserver) • Java (used to be PHP). • ThriZ API. • Text file format + index. • Disk I/O is the problem. 10 • Not only a Key‐Value store. • It serves nearly all the data  we generate with Hadoop.
  11. 11. File Format • Easy to grep / read, from the  command line. • Server is easy to implement &  maintain.  • Very fast thanks to the index. Very  sparse though. • Disk space not really and issue here.  We can always get rid of old indexes. • Problem?  11 Key1 x Size 0 0 0 Key2 x Size 0 0 0 0 KeyN x Size Key1 Value 1 Key1 Value 2 Key1 Value 3 Key1 Value 4 Key1 Value 5 Key2 Value 1 Key2 Value 2 Key2 Value 3 Key2 Value 4 Key2 Value 5 Key2 Value 6 ... ... ... ... ... ... ... ... ... KeyN Value 1 KeyN Value 2 KeyN Value 3 KeyN Value 4 KeyN Value 5 KeyN Value 6 KeyN Value 7 Index File Data File
  12. 12. File Format • Easy to grep / read, from the  command line. • Server is easy to implement &  maintain.  • Very fast thanks to the index. Very  sparse though. • Disk space not really and issue here.  We can always get rid of old indexes. • Problem?  • It takes more Hme to generate the  index than to create the Data File in  Hadoop.  12 Key1 x Size 0 0 0 Key2 x Size 0 0 0 0 KeyN x Size Key1 Value 1 Key1 Value 2 Key1 Value 3 Key1 Value 4 Key1 Value 5 Key2 Value 1 Key2 Value 2 Key2 Value 3 Key2 Value 4 Key2 Value 5 Key2 Value 6 ... ... ... ... ... ... ... ... ... KeyN Value 1 KeyN Value 2 KeyN Value 3 KeyN Value 4 KeyN Value 5 KeyN Value 6 KeyN Value 7 Index File Data File
  13. 13. File Format • Easy to grep / read, from the  command line. • Server is easy to implement &  maintain.  • Very fast thanks to the index. Very  sparse though. • Disk space not really and issue here.  We can always get rid of old indexes. • Problem?  • It takes more Hme to generate the  index than to create the Data File in  Hadoop.  • Like... 6 Hmes more. 13 Key1 x Size 0 0 0 Key2 x Size 0 0 0 0 KeyN x Size Key1 Value 1 Key1 Value 2 Key1 Value 3 Key1 Value 4 Key1 Value 5 Key2 Value 1 Key2 Value 2 Key2 Value 3 Key2 Value 4 Key2 Value 5 Key2 Value 6 ... ... ... ... ... ... ... ... ... KeyN Value 1 KeyN Value 2 KeyN Value 3 KeyN Value 4 KeyN Value 5 KeyN Value 6 KeyN Value 7 Index File Data File
  14. 14. SoluHon? • Move to HBase (or another data storage system) –  Chartserver is not simply a key/value store. –  Lots of people in Last.fm want to use different things, for  different reasons.  •Our ops team do not want (!!) to maintain several different NoSql  systems around. –  This will take some Lme, some experimentaLon,  benchmarks and diplomacy. 14
  15. 15. 15 Our last meeLng to decide which NoSql database we should use.  Sysadmins dressed in funny yellow ou1it.
  16. 16. Requirements for the new file format: • Binary:  –  So it is smaller. –  Store thriZ serialized data. • Compression friendly. • Self indexed: –  We do not want an index file anymore. • Hadoop friendly: –  Generated in Hadoop, we don’t want to preprocess it before serving. • Java/C++/Python friendly: –  These are the languages used in the Data and M.I.R. teams. 16
  17. 17. Requirements for the new file format: • Binary:  –  So it is smaller. –  Store thriZ serialized data. • Compression friendly: • Self indexed: –  We do not want an index file anymore. • Hadoop friendly: –  Generated in Hadoop, we don’t want to preprocess it before serving. • Java/C++/Python friendly: –  These are the languages used in the Data and M.I.R. teams. –  Yeah, we sLll use C++. 17
  18. 18. ! KeyLen (int) ValLen (int) Key (byte[]) Value (byte[]) DATA BLOCK MAGIC (8B) Key-Value (First) …… Key-Value (Last) Data Block 0 Data Block 1 Data Block 2 Meta Block 0 (Optional) Meta Block 1 (Optional) User Defined Metadata, start with METABLOCKMAGIC KeyLen (vint) Key (byte[]) id (1B) ValLen (vint) Val (byte[]) File Info Size or ItemsNum (int) LASTKEY (byte[]) AVG_KEY_LEN (int) AVG_VALUE_LEN (int) COMPARATOR (className) Data Index Meta Index (Optional) Index of Data Block 0 … User Defined INDEX BLOCK MAGIC (8B) Index of Meta Block 0 … Offset(long) DataSize (int) Key (byte[])KeyLen (vint) Trailer INDEX BLOCK MAGIC (8B) Fixed File Trailer (Go to next picture) Offset(long) MetaSize (int) MetaNameLen (vint) MetaName (byte[]) 3 HFile: 18by Schubert Zang hqp://cloudepr.blogspot.com • Based on Google’s SSTable (From Bigtable)   • Keys and Values are byte strings.  • Keys are ordered. • Sequence of blocks. • Block index loaded into memory. • Can be queried with hbase     org.apache.hadoop.hbase.io.hfile.HFile
  19. 19. HFile: 19 // create an HFile reader from a file. Hfile.Reader reader = new HFile.Reader(fs, filePath, new SimpleBlockCache(),true); // load its info into memory. reader.loadFileInfo(); // get a Scanner HFileScanner scan = reader.getScanner(true,true); // create the key we are interested in. KeyValue kvKey = new KeyValue(Bytes.toBytes(key), Bytes.toBytes(“f”),...); // check if the key is in the file. if (0 != scan.seekTo(kvKey.getKey()) { log.error(“Couldn’t find the key”); } else { log.info(“Value:” + scan.getKeyValue().getValue()); }
  20. 20. 20 Before coding... some tests.
  21. 21. Some tests (generaHng the datasets). 21 Plain text format HFile 5.9 Gb (2.4Gb data, 3.5 Gb Index) 2.8 Gigabytes. 369 minutes (6 hours) 25 minutes (25 minutes) Plain text ThriZ serialized Exactly the same contents:   ‐ 16.395.747 keys (16 million)  ‐ 121.930.516 values (121 million)
  22. 22. Some tests (querying randomly). 22 Plain text format HFile 54 seconds. 6.11 seconds. mean: 54 us mean: 6.11 us stdev: 403 us stdev: 108 us max: 72700 us max: 95300 us min: 40.2 us min: 3.35 us Querying with 1 million random keys.
  23. 23. Some tests (querying in order). 23 Plain text format HFile 10.695 seconds. (3 hours) 3287 seconds. (< 1 hour) mean: 652 us mean: 201 us stdev: 3120 us stdev: 2320 us max: 464000 us max: 468000 us min: 37.6 us min: 5.29 us Querying with all the keys (16 million)
  24. 24. Some tests (querying randomly). 24
  25. 25. Some tests (querying all the keys). 25
  26. 26. Merging HFile with Chartserver. • Changes in the Hadoop programs: –  We just created a new program that translated a Sequence File to an HFile. –  Shamelessly copy & pasted Todd Lipcon’s bulk load tool. [have a look at ‘Bibliography’] • Changes in Chartserver. –  Know how to load the HFiles.  –  Know how to access them. • Status – Not in producLon yet.  – Finishing some Junit tests. 26
  27. 27. 27 That’s it Any doubts ? oh... wait.
  28. 28. We are hiring! (http://www.last.fm/about/jobs) 28 Data Scientist Purpose & Background of Role We're seeking two top notch data scientists with strong programming skills to join the small and very enthusiastic data and recommendations team at Last.fm. These two positions are full-time and based in London. Are you a superb data analyst as well as a hands-on implementer that understands the trade-offs of the memory hierarchy and is able to work around constraints in disk speed, memory size and CPU cycles? Are you familiar with all common data structures and their complexity? Do you take pride in being clever and solving difficult problems creatively? Are you full of ideas and always looking for new ways of making use out of data? Are you an advocate for data-driven development and fully capable of conducting a proper A/B test? Do you love music? Requirements: • Solid background in statistics and computer science • Highly fluent in Python and either C++ or Java (or both) • Comfortable with the Unix CLI and shell scripting • Passion for machine learning and data visualisation • Proficient with databases, both relational and non-relational • Experience with Hadoop and analysing terabyte-scale datasets • Familiar with data-driven development and split testing • Basic understanding of common web technologies • Track record in music information retrieval research is a plus
  29. 29. We are hiring! (http://www.last.fm/about/jobs) 29 Data Scientist Purpose & Background of Role We're seeking two top notch data scientists with strong programming skills to join the small and very enthusiastic data and recommendations team at Last.fm. These two positions are full-time and based in London. Are you a superb data analyst as well as a hands-on implementer that understands the trade-offs of the memory hierarchy and is able to work around constraints in disk speed, memory size and CPU cycles? Are you familiar with all common data structures and their complexity? Do you take pride in being clever and solving difficult problems creatively? Are you full of ideas and always looking for new ways of making use out of data? Are you an advocate for data-driven development and fully capable of conducting a proper A/B test? Do you love music? Requirements: • Solid background in statistics and computer science • Highly fluent in Python and either C++ or Java (or both) • Comfortable with the Unix CLI and shell scripting • Passion for machine learning and data visualisation • Proficient with databases, both relational and non-relational • Experience with Hadoop and analysing terabyte-scale datasets • Familiar with data-driven development and split testing • Basic understanding of common web technologies • Track record in music information retrieval research is a plus x 2
  30. 30. 30 That’s it Any doubts ? marc@last.fm @lant
  31. 31. Bibliography. • HFile:  – hqp://issues.apache.org/jira/browse/HBASE‐1818 – hqp://cloudepr.blogspot.com/2009/09/hfile‐block‐indexed‐file‐format‐to.html – hqp://www.larsgeorge.com/2009/10/hbase‐architecture‐101‐storage.html • Todd Lipcon’s Bulk load tool: – hQp://hbase.apache.org/docs/r0.89.20100726/bulk‐loads.html – TRUNK/org/apache/hadoop/hbase/mapreduce/ImportTsv.java 31

×