Your SlideShare is downloading. ×
0
Hadoop Record Reader in Python<br />HUG: Nov 18 2009<br />Paul Tarjan<br />http://paulisageek.com<br />@ptarjan<br />http:...
Hey Jute…<br />Tabs and newlines are good and all<br />For lots of data, don’t do that<br />
don’t make it bad...<br />Hadoop has a native data storage format called Hadoop Record or “Jute”<br />org.apache.hadoop.re...
take a data structure…<br />There is a Data Definition Language!<br />module links {<br />		class Link {<br />ustringURL;<...
and make it better…<br />And a compiler<br />$ rcc -lc++ inclrec.jrtestrec.jr<br />	namespace inclrec {<br />		class RI :<...
remember, to only use C++/Java<br />$rcc--help<br />	Usage: rcc --language<br />[java|c++] ddl-files<br />
then you can start to make it better…<br />I wanted it in python<br />Need 2 parts. <br />Parsing library and <br />DDL tr...
Hey Jute don&apos;t be afraid…<br />
you were made to go out and get her…<br />http://github.com/ptarjan/hadoop_record<br />
the minute you let her under your skin…<br />I bet you thought I was done with “Hey Jude” references, eh?<br />How I built...
and any time you feel the pain…<br />Parsing the binary format is hard<br />Vector vsstruct???<br />struct= &quot;s{&quot;...
nanananana<br />Future work<br />DDL Converter<br />Integrate it officially<br />Record writer (should be easy)<br />Seque...
Upcoming SlideShare
Loading in...5
×

Hadoop Jute Record Python

2,900

Published on

My talk for the Hadoop User Group Nov 18 2009 about: Parsing hadoop records using python

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,900
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop Jute Record Python"

  1. 1. Hadoop Record Reader in Python<br />HUG: Nov 18 2009<br />Paul Tarjan<br />http://paulisageek.com<br />@ptarjan<br />http://github.com/ptarjan/hadoop_record<br />
  2. 2. Hey Jute…<br />Tabs and newlines are good and all<br />For lots of data, don’t do that<br />
  3. 3. don’t make it bad...<br />Hadoop has a native data storage format called Hadoop Record or “Jute”<br />org.apache.hadoop.record<br />http://en.wikipedia.org/wiki/Jute<br />
  4. 4. take a data structure…<br />There is a Data Definition Language!<br />module links {<br /> class Link {<br />ustringURL;<br />booleanisRelative;<br />ustringanchorText;<br /> };<br />} <br />
  5. 5. and make it better…<br />And a compiler<br />$ rcc -lc++ inclrec.jrtestrec.jr<br /> namespace inclrec {<br /> class RI :<br /> public hadoop::Record {<br /> private:<br /> int32_t I32;<br /> double D;<br />std::string S;<br />
  6. 6. remember, to only use C++/Java<br />$rcc--help<br /> Usage: rcc --language<br />[java|c++] ddl-files<br />
  7. 7. then you can start to make it better…<br />I wanted it in python<br />Need 2 parts. <br />Parsing library and <br />DDL translator<br />I only did the first part<br />If you need second part, let me know<br />
  8. 8. Hey Jute don&apos;t be afraid…<br />
  9. 9. you were made to go out and get her…<br />http://github.com/ptarjan/hadoop_record<br />
  10. 10. the minute you let her under your skin…<br />I bet you thought I was done with “Hey Jude” references, eh?<br />How I built it<br />Ply == lex and yacc<br />Parser == 234 lines including tests!<br />Outputs generic data types<br />You have to do the class transform yourself<br />You can use my lex and yacc stuff in your language of choice<br />
  11. 11. and any time you feel the pain…<br />Parsing the binary format is hard<br />Vector vsstruct???<br />struct= &quot;s{&quot; record *(&quot;,&quot; record) &quot;}&quot;<br />vector = &quot;v{&quot; [record *(&quot;,&quot; record)] &quot;}&quot;<br />LazyString – don’t decode if not needed<br />99% of my hadoop time was decoding strings I didn’t need<br />Binary on disk -&gt; CSV -&gt; python == wastefull<br />Hadoopupacks zip files – name it .mod<br />
  12. 12. nanananana<br />Future work<br />DDL Converter<br />Integrate it officially<br />Record writer (should be easy)<br />SequenceFileAsOutputFormat<br />Integrate your feedback<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×