Hadoop Record Reader In Python


Published on

Published in: Technology
  1. 1. Hadoop Record Reader in Python<br />HUG: Nov 18 2009<br />Paul Tarjan<br /><br />@ptarjan<br /><br />
  2. 2. Hey Jute…<br />Tabs and newlines are good and all<br />For lots of data, don’t do that<br />
  3. 3. don’t make it bad...<br />Hadoop has a native data storage format called Hadoop Record or “Jute”<br />org.apache.hadoop.record<br /><br />
  4. 4. take a data structure…<br />There is a Data Definition Language!<br />module links {<br /> class Link {<br />ustringURL;<br />booleanisRelative;<br />ustringanchorText;<br /> };<br />} <br />
  5. 5. and make it better…<br />And a compiler<br />$ rcc -lc++ inclrec.jrtestrec.jr<br /> namespace inclrec {<br /> class RI :<br /> public hadoop::Record {<br /> private:<br /> int32_t I32;<br /> double D;<br />std::string S;<br />
  6. 6. remember, to only use C++/Java<br />$rcc--help<br /> Usage: rcc --language<br />[java|c++] ddl-files<br />
  7. 7. then you can start to make it better…<br />I wanted it in python<br />Need 2 parts:<br />Parsing library and <br />DDL translator<br />I only did the first part<br />If you need second part, let me know<br />
  8. 8. Hey Jute don&apos;t be afraid…<br />
  9. 9. you were made to go out and get her…<br /><br />
  10. 10. the minute you let her under your skin…<br />I bet you thought I was done with “Hey Jude” references, eh?<br />How I built it<br />Ply == lex and yacc<br />Parser == 234 lines including tests!<br />Outputs generic data types<br />You have to do the class transform yourself<br />You can use my lex and yacc stuff in your language of choice<br />
  11. 11. and any time you feel the pain…<br />Parsing the binary format is hard<br />Vector vsstruct???<br />struct= &quot;s{&quot; record *(&quot;,&quot; record) &quot;}&quot;<br />vector = &quot;v{&quot; [record *(&quot;,&quot; record)] &quot;}&quot;<br />LazyString – don’t decode if not needed<br />99% of my hadoop time was decoding strings I didn’t need<br />Binary on disk -&gt; CSV -&gt; python == wasteful<br />Hadoopupacks zip files – name it .mod<br />
  12. 12. nanananana<br />Future work<br />DDL Converter<br />Integrate it officially<br />Record writer (should be easy)<br />SequenceFileAsOutputFormat<br />Integrate your feedback<br />