Hadoop Record Reader In Python

  • 3,071 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,071
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
13
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Record Reader in Python
    HUG: Nov 18 2009
    Paul Tarjan
    http://paulisageek.com
    @ptarjan
    http://github.com/ptarjan/hadoop_record
  • 2. Hey Jute…
    Tabs and newlines are good and all
    For lots of data, don’t do that
  • 3. don’t make it bad...
    Hadoop has a native data storage format called Hadoop Record or “Jute”
    org.apache.hadoop.record
    http://en.wikipedia.org/wiki/Jute
  • 4. take a data structure…
    There is a Data Definition Language!
    module links {
    class Link {
    ustringURL;
    booleanisRelative;
    ustringanchorText;
    };
    }
  • 5. and make it better…
    And a compiler
    $ rcc -lc++ inclrec.jrtestrec.jr
    namespace inclrec {
    class RI :
    public hadoop::Record {
    private:
    int32_t I32;
    double D;
    std::string S;
  • 6. remember, to only use C++/Java
    $rcc--help
    Usage: rcc --language
    [java|c++] ddl-files
  • 7. then you can start to make it better…
    I wanted it in python
    Need 2 parts:
    Parsing library and
    DDL translator
    I only did the first part
    If you need second part, let me know
  • 8. Hey Jute don't be afraid…
  • 9. you were made to go out and get her…
    http://github.com/ptarjan/hadoop_record
  • 10. the minute you let her under your skin…
    I bet you thought I was done with “Hey Jude” references, eh?
    How I built it
    Ply == lex and yacc
    Parser == 234 lines including tests!
    Outputs generic data types
    You have to do the class transform yourself
    You can use my lex and yacc stuff in your language of choice
  • 11. and any time you feel the pain…
    Parsing the binary format is hard
    Vector vsstruct???
    struct= "s{" record *("," record) "}"
    vector = "v{" [record *("," record)] "}"
    LazyString – don’t decode if not needed
    99% of my hadoop time was decoding strings I didn’t need
    Binary on disk -> CSV -> python == wasteful
    Hadoopupacks zip files – name it .mod
  • 12. nanananana
    Future work
    DDL Converter
    Integrate it officially
    Record writer (should be easy)
    SequenceFileAsOutputFormat
    Integrate your feedback