March 2012 HUG: JuteRC compiler

  • 1,898 views
Uploaded on

Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always …

Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,898
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Jute RC Compiler Tanping Wang Yahoo! User Data Analytics
  • 2. Agenda• How we use Jute compiler today• JuteRC compiler• By using JuteRC, what we have achievedYahoo! Presentation Template, Confidential 2 03/30/12
  • 3. Hadoop Record Compiler (Jute)• Generates serialization code to store data in the Sequence File format org.apache.hadoop.recordYahoo! Presentation Template, Confidential 3 03/30/12
  • 4. How Do We Use Jute Today• Use Data Definition Language to define my data type: class MyDataType { buffer myBuffer; long myLong; }• Use Jute compiler to generate serialization code: $ rcc –language java mydatatype.jrYahoo! Presentation Template, Confidential 4 03/30/12
  • 5. Jute Generates Serialization Code for mepublic void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); }private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); }Yahoo! Presentation Template, Confidential 5 03/30/12
  • 6. Today Yahoo audience ETL pipeline processes tens of terabytes of data per day. We rely on Jute. We use Sequence File to store our data.Yahoo! Presentation Template, Confidential 6 03/30/12
  • 7. However, We Want To Use RC Format.Yahoo! Presentation Template, Confidential 7 03/30/12
  • 8. RC File Format • RCFile shares much similarity with Sequence File, but splits a file into row groups. Inside each row group, it stores columns as rows. • Similar data types are grouped together. This potentially brings better compression rate.Yahoo! Presentation Template, Confidential 8 03/30/12
  • 9. Jute only supports Sequence File Format. So We built JuteRC Compiler.Yahoo! Presentation Template, Confidential 9 03/30/12
  • 10. Data TypeYahoo! Presentation Template, Confidental 10 03/30/12
  • 11. Also…• For each JType, overwrite genReadMethod and genWriteMethod.• Changed CodeGenerator in Jute.Yahoo! Presentation Template, Confidential 11 03/30/12
  • 12. Serialization Code Generated by Jute v.s.JuteRC Jute: public void serialize(final org.apache.hadoop.record.RecordOutput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(this,_rio_tag); _rio_a.writeBuffer(myBuffer,"myBuffer"); _rio_a.writeLong(myLong,"myLong"); _rio_a.endRecord(this,_rio_tag); } JuteRC: public class MyDataType extends org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable { public void serialize() { int writeIndx = 0; try { com.yahoo.ccdi.fetl.RcUtil.writeBuffer(this, myBuffer, writeIndx++); com.yahoo.ccdi.fetl.RcUtil.writeLong(this, myLong, writeIndx++); } catch(java.io.IOException e) { } }Yahoo! Presentation Template, Confidential 12 03/30/12
  • 13. Deserialization Code Jute: private void deserializeWithoutFilter(final org.apache.hadoop.record.RecordInput _rio_a, final String _rio_tag) throws java.io.IOException { _rio_a.startRecord(_rio_tag); myBuffer=_rio_a.readBuffer("myBuffer"); myLong=_rio_a.readLong("myLong"); _rio_a.endRecord(_rio_tag); } JuteRC: public void deserialize(org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable bra){ int readIndx = 0; try { myBuffer=com.yahoo.ccdi.fetl.RcUtil.readBuffer(bra, readIndx++); myLong=com.yahoo.ccdi.fetl.RcUtil.readLong(bra, readIndx++); }catch(java.io.IOException e) { }}}Yahoo! Presentation Template, Confidential 13 03/30/12
  • 14. Using RC• Convert sequence file format file to RC format: achieved 26~28% file size reduction.• Faster IO performance: reading/writing 0.6X• Process our data using both Hive and PIG on top of HCatalog.Yahoo! Presentation Template, Confidential 14 03/30/12
  • 15. Open Source• We are in the process to open source JuteRC. Under review by Yahoo! Open Source Working Group.• MapReduce programmer can directly plug in the code generated by JuteRC and store their data in RC format.Yahoo! Presentation Template, Confidential 15 03/30/12
  • 16. References• RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systemshttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-• Hive RCFile:http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/ql/io/RCFile.htYahoo! Presentation Template, Confidential 16 03/30/12