Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Record Reader In Python


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop Record Reader In Python

  1. 1. Hadoop Record Reader in Python<br />HUG: Nov 18 2009<br />Paul Tarjan<br /><br />@ptarjan<br /><br />
  2. 2. Hey Jute…<br />Tabs and newlines are good and all<br />For lots of data, don’t do that<br />
  3. 3. don’t make it bad...<br />Hadoop has a native data storage format called Hadoop Record or “Jute”<br />org.apache.hadoop.record<br /><br />
  4. 4. take a data structure…<br />There is a Data Definition Language!<br />module links {<br /> class Link {<br />ustringURL;<br />booleanisRelative;<br />ustringanchorText;<br /> };<br />} <br />
  5. 5. and make it better…<br />And a compiler<br />$ rcc -lc++ inclrec.jrtestrec.jr<br /> namespace inclrec {<br /> class RI :<br /> public hadoop::Record {<br /> private:<br /> int32_t I32;<br /> double D;<br />std::string S;<br />
  6. 6. remember, to only use C++/Java<br />$rcc--help<br /> Usage: rcc --language<br />[java|c++] ddl-files<br />
  7. 7. then you can start to make it better…<br />I wanted it in python<br />Need 2 parts:<br />Parsing library and <br />DDL translator<br />I only did the first part<br />If you need second part, let me know<br />
  8. 8. Hey Jute don&apos;t be afraid…<br />
  9. 9. you were made to go out and get her…<br /><br />
  10. 10. the minute you let her under your skin…<br />I bet you thought I was done with “Hey Jude” references, eh?<br />How I built it<br />Ply == lex and yacc<br />Parser == 234 lines including tests!<br />Outputs generic data types<br />You have to do the class transform yourself<br />You can use my lex and yacc stuff in your language of choice<br />
  11. 11. and any time you feel the pain…<br />Parsing the binary format is hard<br />Vector vsstruct???<br />struct= &quot;s{&quot; record *(&quot;,&quot; record) &quot;}&quot;<br />vector = &quot;v{&quot; [record *(&quot;,&quot; record)] &quot;}&quot;<br />LazyString – don’t decode if not needed<br />99% of my hadoop time was decoding strings I didn’t need<br />Binary on disk -&gt; CSV -&gt; python == wasteful<br />Hadoopupacks zip files – name it .mod<br />
  12. 12. nanananana<br />Future work<br />DDL Converter<br />Integrate it officially<br />Record writer (should be easy)<br />SequenceFileAsOutputFormat<br />Integrate your feedback<br />