Hadoop Record Reader in PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
Hey Jute…Tabs and newlines are good and allFor lots of data, don’t do that
don’t make it bad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
take a data structure…There is a Data Definition Language!module links {		class Link {ustringURL;booleanisRelative;ustringanchorText;		};}
and make it better…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr	namespace inclrec {		class RI :		public hadoop::Record {		    private:			int32_t I32;			double D;std::string S;
remember, to only use C++/Java$rcc--help	Usage: rcc --language[java|c++] ddl-files
then you can start to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know
Hey Jute don't be afraid…
you were made to go out and get her…http://github.com/ptarjan/hadoop_record
the minute you let her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
and any time you feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod
nananananaFuture workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback

Hadoop Record Reader In Python

  • 1.
    Hadoop Record Readerin PythonHUG: Nov 18 2009Paul Tarjanhttp://paulisageek.com@ptarjanhttp://github.com/ptarjan/hadoop_record
  • 2.
    Hey Jute…Tabs andnewlines are good and allFor lots of data, don’t do that
  • 3.
    don’t make itbad...Hadoop has a native data storage format called Hadoop Record or “Jute”org.apache.hadoop.recordhttp://en.wikipedia.org/wiki/Jute
  • 4.
    take a datastructure…There is a Data Definition Language!module links { class Link {ustringURL;booleanisRelative;ustringanchorText; };}
  • 5.
    and make itbetter…And a compiler$ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D;std::string S;
  • 6.
    remember, to onlyuse C++/Java$rcc--help Usage: rcc --language[java|c++] ddl-files
  • 7.
    then you canstart to make it better…I wanted it in pythonNeed 2 parts:Parsing library and DDL translatorI only did the first partIf you need second part, let me know
  • 8.
  • 9.
    you were madeto go out and get her…http://github.com/ptarjan/hadoop_record
  • 10.
    the minute youlet her under your skin…I bet you thought I was done with “Hey Jude” references, eh?How I built itPly == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourselfYou can use my lex and yacc stuff in your language of choice
  • 11.
    and any timeyou feel the pain…Parsing the binary format is hardVector vsstruct???struct= "s{" record *("," record) "}"vector = "v{" [record *("," record)] "}"LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I didn’t needBinary on disk -> CSV -> python == wastefulHadoopupacks zip files – name it .mod
  • 12.
    nananananaFuture workDDL ConverterIntegrateit officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback