Paul Tarjan ( http://github.com/ptarjan ) presented this to the Hadoop User Group at the Yahoo! Sunnyvale campus on 11/18/09. Paul describes his solution for building a Hadoop Record Reader in Python.
Next-generation AAM aircraft unveiled by Supernal, S-A2
Nov HUG 2009: Hadoop Record Reader In Python
1. Hadoop Record Reader in Python HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record
2. Hey Jute… Tabs and newlines are good and all For lots of data, don’t do that
3. don’t make it bad... Hadoop has a native data storage format called Hadoop Record or “Jute” org.apache.hadoop.record http://en.wikipedia.org/wiki/Jute
4. take a data structure… There is a Data Definition Language! module links { class Link { ustringURL; booleanisRelative; ustringanchorText; }; }
5. and make it better… And a compiler $ rcc -lc++ inclrec.jrtestrec.jr namespace inclrec { class RI : public hadoop::Record { private: int32_t I32; double D; std::string S;
6. remember, to only use C++/Java $rcc--help Usage: rcc --language [java|c++] ddl-files
7. then you can start to make it better… I wanted it in python Need 2 parts: Parsing library and DDL translator I only did the first part If you need second part, let me know
9. you were made to go out and get her… http://github.com/ptarjan/hadoop_record
10. the minute you let her under your skin… I bet you thought I was done with “Hey Jude” references, eh? How I built it Ply == lex and yacc Parser == 234 lines including tests! Outputs generic data types You have to do the class transform yourself You can use my lex and yacc stuff in your language of choice
11. and any time you feel the pain… Parsing the binary format is hard Vector vsstruct??? struct= "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}" LazyString – don’t decode if not needed 99% of my hadoop time was decoding strings I didn’t need Binary on disk -> CSV -> python == wasteful Hadoopupacks zip files – name it .mod
12. nanananana Future work DDL Converter Integrate it officially Record writer (should be easy) SequenceFileAsOutputFormat Integrate your feedback