Your SlideShare is downloading. ×
Hadoop Record Reader In Python
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop Record Reader In Python

3,166
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,166
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop Record Reader in Python
    HUG: Nov 18 2009
    Paul Tarjan
    http://paulisageek.com
    @ptarjan
    http://github.com/ptarjan/hadoop_record
  • 2. Hey Jute…
    Tabs and newlines are good and all
    For lots of data, don’t do that
  • 3. don’t make it bad...
    Hadoop has a native data storage format called Hadoop Record or “Jute”
    org.apache.hadoop.record
    http://en.wikipedia.org/wiki/Jute
  • 4. take a data structure…
    There is a Data Definition Language!
    module links {
    class Link {
    ustringURL;
    booleanisRelative;
    ustringanchorText;
    };
    }
  • 5. and make it better…
    And a compiler
    $ rcc -lc++ inclrec.jrtestrec.jr
    namespace inclrec {
    class RI :
    public hadoop::Record {
    private:
    int32_t I32;
    double D;
    std::string S;
  • 6. remember, to only use C++/Java
    $rcc--help
    Usage: rcc --language
    [java|c++] ddl-files
  • 7. then you can start to make it better…
    I wanted it in python
    Need 2 parts:
    Parsing library and
    DDL translator
    I only did the first part
    If you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her…
    http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin…
    I bet you thought I was done with “Hey Jude” references, eh?
    How I built it
    Ply == lex and yacc
    Parser == 234 lines including tests!
    Outputs generic data types
    You have to do the class transform yourself
    You can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain…
    Parsing the binary format is hard
    Vector vsstruct???
    struct= "s{" record *("," record) "}"
    vector = "v{" [record *("," record)] "}"
    LazyString – don’t decode if not needed
    99% of my hadoop time was decoding strings I didn’t need
    Binary on disk -> CSV -> python == wasteful
    Hadoopupacks zip files – name it .mod
  • 12. nanananana
    Future work
    DDL Converter
    Integrate it officially
    Record writer (should be easy)
    SequenceFileAsOutputFormat
    Integrate your feedback