Hadoop Jute Record Python
Upcoming SlideShare
Loading in...5
×
 

Hadoop Jute Record Python

on

  • 3,885 views

My talk for the Hadoop User Group Nov 18 2009 about: Parsing hadoop records using python

My talk for the Hadoop User Group Nov 18 2009 about: Parsing hadoop records using python

Statistics

Views

Total Views
3,885
Views on SlideShare
3,872
Embed Views
13

Actions

Likes
2
Downloads
18
Comments
0

2 Embeds 13

http://www.slideshare.net 9
http://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Jute Record Python Hadoop Jute Record Python Presentation Transcript

    • Hadoop Record Reader in Python
      HUG: Nov 18 2009
      Paul Tarjan
      http://paulisageek.com
      @ptarjan
      http://github.com/ptarjan/hadoop_record
    • Hey Jute…
      Tabs and newlines are good and all
      For lots of data, don’t do that
    • don’t make it bad...
      Hadoop has a native data storage format called Hadoop Record or “Jute”
      org.apache.hadoop.record
      http://en.wikipedia.org/wiki/Jute
    • take a data structure…
      There is a Data Definition Language!
      module links {
      class Link {
      ustringURL;
      booleanisRelative;
      ustringanchorText;
      };
      }
    • and make it better…
      And a compiler
      $ rcc -lc++ inclrec.jrtestrec.jr
      namespace inclrec {
      class RI :
      public hadoop::Record {
      private:
      int32_t I32;
      double D;
      std::string S;
    • remember, to only use C++/Java
      $rcc--help
      Usage: rcc --language
      [java|c++] ddl-files
    • then you can start to make it better…
      I wanted it in python
      Need 2 parts.
      Parsing library and
      DDL translator
      I only did the first part
      If you need second part, let me know
    • Hey Jute don't be afraid…
    • you were made to go out and get her…
      http://github.com/ptarjan/hadoop_record
    • the minute you let her under your skin…
      I bet you thought I was done with “Hey Jude” references, eh?
      How I built it
      Ply == lex and yacc
      Parser == 234 lines including tests!
      Outputs generic data types
      You have to do the class transform yourself
      You can use my lex and yacc stuff in your language of choice
    • and any time you feel the pain…
      Parsing the binary format is hard
      Vector vsstruct???
      struct= "s{" record *("," record) "}"
      vector = "v{" [record *("," record)] "}"
      LazyString – don’t decode if not needed
      99% of my hadoop time was decoding strings I didn’t need
      Binary on disk -> CSV -> python == wastefull
      Hadoopupacks zip files – name it .mod
    • nanananana
      Future work
      DDL Converter
      Integrate it officially
      Record writer (should be easy)
      SequenceFileAsOutputFormat
      Integrate your feedback