Hadoop Record Reader In Python

Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record

Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that

don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute

take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }

and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;

remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files

then you can start to make it better… I wanted it in python Need 2 parts: Parsing library and DDL translator I only did the first part If you need second part, let me know

Hey Jute don't be afraid…

you were made to go out and get her… http://github.com/ptarjan/hadoop_record

the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice

and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wasteful Hadoopupacks zip files – name it .mod

nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback

Hadoop Record Reader In Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Record Reader In Python

Similar to Hadoop Record Reader In Python (20)

More from Hadoop User Group

More from Hadoop User Group (11)

Recently uploaded

Recently uploaded (20)

Hadoop Record Reader In Python