Big Data with Pig
                          and Python
                                Shawn Hermans
                       Omaha Dynamic Languages User Group
                                 April 8th, 2013




Tuesday, April 9, 13
About Me

                       • Mathematician/Physicist turned Consultant
                       • Graduate Student in CS at UNO
                       • Current Software Engineer at Sojern


Tuesday, April 9, 13
Working with Big Data



Tuesday, April 9, 13
What is Big Data?
                       Data Source       Size
                                                        Gigabytes -
   Wikipedia Database Dump               9GB         Normal size for relational
                                                           databases
              Open Street Map           19GB
                                                         Terabytes -
                                                      Relational databases may
               Common Crawl             81TB         start to experience scaling
                                                                issues
                1000 Genomes            200TB
                                                         Petabytes -
                                                        Relational databases
        Large Hadron Collider        15PB annually   struggle to scale without a
                                                          lot of fine tuning




Tuesday, April 9, 13
Working With Data
      Expectation                                        Reality
                            •   Different File Formats

                            •   Missing Values

                            •   Inconsistent Schema

                            •   Loosely Structured

                            •   Lots of it




Tuesday, April 9, 13
MapReduce
                                                                                         •        Map - Emit key/
                                                                                                  value pairs from
                                                                                                  data

                                                                                         •        Reduce - Collect
                                                                                                  data with common
                                                                                                  keys

                                                                                         •        Tries to minimize
                                                                                                  moving data
                                                                                                  between nodes

                       Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview




Tuesday, April 9, 13
MapReduce Issues

                       • Very low-level abstraction
                       • Cumbersome Java API
                       • Unfamiliar to data analysts
                       • Rudimentary support for data pipelines

Tuesday, April 9, 13
Pig
                       • Eats anything
                       • SQL-like, procedural data flow language
                       • Extensible with Java, Jython, Groovy, Ruby
                         or JavaScript
                       • Provides opportunities to optimize
                         workflows



Tuesday, April 9, 13
Alternatives
                       • Java MapReduce API
                       • Hadoop Streaming
                       • Hive
                       • Spark
                       • Cascading
                       • Cascalog
Tuesday, April 9, 13
Python

                       • Data analysis - pandas, numpy, networkx
                       • Machine learning - scikits.learn, milk
                       • Scientific - scipy, pyephem, astropysics
                       • Visualization - matplotlib, d3py, ggplot

Tuesday, April 9, 13
Pig Features



Tuesday, April 9, 13
Input/Output
                       • HBase           • Sequence File
                       • JDBC Database   • Hive Columnar
                       • JSON            • XML
                       • CSV/TSV         • Apache Log
                       • Avro            • Thrift
                       • ProtoBuff       • Regex
Tuesday, April 9, 13
Relational Operators
                       LIMIT   GROUP   FILTER    CROSS


              COGROUP          JOIN    STORE    DISTINCT


               FOREACH         LOAD    ORDER    UNION




Tuesday, April 9, 13
Built In Functions
                       COS       SIN      AVG      SUM


                  COUNT         RANDOM   LOWER    UPPER


                CONCAT           MAX      MIN    TOKENIZE




Tuesday, April 9, 13
User Defined Functions
                       • Easy way to add arbitrary code to Pig
                        • Eval - Filter, aggregate, or evaluate
                        • Storage - Load/Store data
                       • Full support for Java and Jython
                       • Experimental support for Groovy, Ruby and
                         JavaScript


Tuesday, April 9, 13
Census Example


Tuesday, April 9, 13
Getting Data




Tuesday, April 9, 13
Convert to TSV
          ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"




                       • Uses Geospatial Data Abstraction Library
                         (GDAL) to convert to TSV
                       • TSV > CSV


Tuesday, April 9, 13
Inspect Headers
                       f = open('CSA_2010Census_DP1.tsv')
                       header = f.readline()
                       headers = header.strip('n').split('t')
                       list(enumerate(headers))

                       [(0,   'WKT'),
                        (1,   'GEOID10'),
                        (2,   'NAMELSAD10'),
                        (3,   'ALAND10'),
                        (4,   'AWATER10'),
                        (5,   'INTPTLAT10'),
                        (6,   'INTPTLON10'),
                        (7,   'DP0010001'),
                         .
                         .
                         .




Tuesday, April 9, 13
Pig Quick Start
      •       Download Pig Distribution

      •       Untar package

      •       Start Pig in local mode
                       pig -x local
                       grunt> ls
                       file:/data/CSA_2010Census_DP1 1.dbf<r 1>                                  841818
                       file:/data/CSA_2010Census_DP1.prj<r 1>                                  167
                       file:/data/CSA_2010Census_DP1.shp<r 1>                                  76180308
                       file:/data/CSA_2010Census_DP1.shx<r 1>                                  3596
                       file:/data/CSA_2010Census_DP1.tsv<r 1>                                  111224058


                                                  http://pig.apache.org/releases.html

                                      https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads

Tuesday, April 9, 13
Loading Data

     grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();




Tuesday, April 9, 13
Extracting Data
       grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
       grunt> extracted_no_types = FOREACH csas GENERATE $2
          AS name, $7 as population;
       grunt> describe extracted_no_types
       extracted_no_types: {name: bytearray,population: bytearray};




Tuesday, April 9, 13
Adding Schema
       grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();
       grunt> extracted = FOREACH csas GENERATE $2
          AS name:chararray, $7 as population:int;
       grunt> describe extracted;
       extracted: {name: chararray,population: int}




Tuesday, April 9, 13
Ordering
        grunt> ordered = ORDER extracted by population DESC;
        grunt> dump ordered;

        ("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)
        ("Los Angeles-Long Beach-Riverside, CA CSA",17877006)
        ("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)
        ("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",
        8572971)
        ("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)
        ("San Jose-San Francisco-Oakland, CA CSA",7468390)
        ("Dallas-Fort Worth, TX CSA",6731317)
        ("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)




Tuesday, April 9, 13
Storing Data
  grunt> STORE extracted INTO 'extracted_data' USING PigStorage('t', '-schema');




        ls -a
        .part-m-00035.crc   .part-m-00115.crc   .pig_header    part-m-00077   part-m-00157
        .part-m-00036.crc   .part-m-00116.crc   .pig_schema    part-m-00078   part-m-00158
        .part-m-00037.crc   .part-m-00117.crc   _SUCCESS       part-m-00079   part-m-00159
        .part-m-00038.crc   .part-m-00118.crc   part-m-00000   part-m-00080   part-m-00160




Tuesday, April 9, 13
Space Catalog Example



Tuesday, April 9, 13
Space Catalog
                       • 14,000+ objects in public catalog
                       • Use Two Line Element sets to propagate
                         out positions and velocities
                       • Can generate over 100 million positions &
                         velocities per day




Tuesday, April 9, 13
Two Line Elements
         ISS (ZARYA)
         1 25544U 98067A  08264.51782528 −.00002182 00000-0 -11606-4 0 2927
         2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537



        • Use Python script to convert to Pig friendly TSV
        • Create Python UDF to parse TLE into parameters
        • Use Python UDF with Java libraries to propagate out
                positions



Tuesday, April 9, 13
Python UDFs
                       • Easy way to extend Pig with new functions
                       • Uses Jython which is at Python 2.5
                       • Cannot take advantage of libraries with C
                         dependencies (e.g. numpy, scikits, etc...)
                       • Can use Java classes

Tuesday, April 9, 13
TLE parsing
                                                             BSTAR Drag
                                  54-61                                                                -11606-4
                                                          (Decimal Assumed)

        def	
  parse_tle_number(tle_number_string):
        	
  	
  	
  	
  split_string	
  =	
  tle_number_string.split('-­‐')
        	
  	
  	
  	
  if	
  len(split_string)	
  ==	
  3:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  '-­‐'	
  +	
  str(split_string[1])	
  +	
  'e-­‐'	
  +	
  str(int(split_string[2])+1)
        	
  	
  	
  	
  elif	
  len(split_string)	
  ==	
  2:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  str(split_string[0])	
  +	
  'e-­‐'	
  +	
  str(int(split_string[1])+1)
        	
  	
  	
  	
  elif	
  len(split_string)	
  ==	
  1:
        	
  	
  	
  	
  	
  	
  	
  	
  new_number	
  =	
  '0.'	
  +	
  str(split_string[0])
        	
  	
  	
  	
  else:
        	
  	
  	
  	
  	
  	
  	
  	
  raise	
  TypeError('Input	
  is	
  not	
  in	
  the	
  TLE	
  float	
  format')
        	
  
        	
  	
  	
  	
  return	
  float(new_number)




                                           Full parser at https://gist.github.com/shawnhermans/4569360

Tuesday, April 9, 13
Simple UDF
        import tleparser

        @outputSchema("params:map[]")
        def parseTle(name, line1, line2):
            params = tleparser.parse_tle(name, line1, line2)
            return params




Tuesday, April 9, 13
Extract Parameters
       grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
          AS (name:chararray, line1:chararray, line2:chararray);
       grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;
       grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

       ([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#
       2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e
       ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification
       #U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN
       24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])




Tuesday, April 9, 13
Storing Results

       grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);
       grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');




Tuesday, April 9, 13
UDF with Java Import
  from jsattrak.objects import SatelliteTleSGP4

  @outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")
  def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points):
      satellite = SatelliteTleSGP4(name, line1, line2)
      ecef_positions = []
      increment = (float(end_time)-float(start_time))/float(number_of_points)
      current_time = start_time

          while current_time <= end_time:
              positions = [current_time]
              positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time)))
              ecef_positions.append(tuple(positions))

                  current_time += increment

          return ecef_positions




Tuesday, April 9, 13
Propagate Positions
    grunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;
    grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
      AS (name:chararray, line1:chararray, line2:chararray);
    grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2),
      myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);
    grunt> flattened = FOREACH propagated
      GENERATE params#'satellite_number', FLATTEN(propagated);
    propagated: {params: map[],propagated: {positions:
      (time: double,x: double,y: double,z: double)}}
    grunt> DESCRIBE flattened;
    flattened: {bytearray,propagated::time: double,propagated::x: double,
      propagated::y: double,propagated::z: double}




Tuesday, April 9, 13
Result

  (38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)
  (38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)
  (38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)
  (38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)
  (38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)




Tuesday, April 9, 13
Pig on Amazon EMR



Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Tuesday, April 9, 13
Pig with EMR




Tuesday, April 9, 13
Pig with EMR

                       • SSH in to box to run interactive Pig session
                       • Load data to/from S3
                       • Run standalone Pig scripts on demand


Tuesday, April 9, 13
Conclusion



Tuesday, April 9, 13
Other Useful Tools
                       • Python-dateutil : Super-duper date parser
                       • Oozie : Hadoop workflow engine
                       • Piggybank and Elephant Bird : 3rd party pig
                         libraries
                       • Chardet: Character detection library for
                         Python



Tuesday, April 9, 13
Parting Thoughts
                       •   Great ETL tool/language

                       •   Flexible enough to write general purpose
                           MapReduce jobs

                       •   Limited, but emerging 3rd party libraries

                       •   Jython for UDFs is extremely limiting (Spark?)

       Twitter: @shawnhermans
       Email: shawnhermans@gmail.com


Tuesday, April 9, 13

Pig and Python to Process Big Data

  • 1.
    Big Data withPig and Python Shawn Hermans Omaha Dynamic Languages User Group April 8th, 2013 Tuesday, April 9, 13
  • 2.
    About Me • Mathematician/Physicist turned Consultant • Graduate Student in CS at UNO • Current Software Engineer at Sojern Tuesday, April 9, 13
  • 3.
    Working with BigData Tuesday, April 9, 13
  • 4.
    What is BigData? Data Source Size Gigabytes - Wikipedia Database Dump 9GB Normal size for relational databases Open Street Map 19GB Terabytes - Relational databases may Common Crawl 81TB start to experience scaling issues 1000 Genomes 200TB Petabytes - Relational databases Large Hadron Collider 15PB annually struggle to scale without a lot of fine tuning Tuesday, April 9, 13
  • 5.
    Working With Data Expectation Reality • Different File Formats • Missing Values • Inconsistent Schema • Loosely Structured • Lots of it Tuesday, April 9, 13
  • 6.
    MapReduce • Map - Emit key/ value pairs from data • Reduce - Collect data with common keys • Tries to minimize moving data between nodes Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview Tuesday, April 9, 13
  • 7.
    MapReduce Issues • Very low-level abstraction • Cumbersome Java API • Unfamiliar to data analysts • Rudimentary support for data pipelines Tuesday, April 9, 13
  • 8.
    Pig • Eats anything • SQL-like, procedural data flow language • Extensible with Java, Jython, Groovy, Ruby or JavaScript • Provides opportunities to optimize workflows Tuesday, April 9, 13
  • 9.
    Alternatives • Java MapReduce API • Hadoop Streaming • Hive • Spark • Cascading • Cascalog Tuesday, April 9, 13
  • 10.
    Python • Data analysis - pandas, numpy, networkx • Machine learning - scikits.learn, milk • Scientific - scipy, pyephem, astropysics • Visualization - matplotlib, d3py, ggplot Tuesday, April 9, 13
  • 11.
  • 12.
    Input/Output • HBase • Sequence File • JDBC Database • Hive Columnar • JSON • XML • CSV/TSV • Apache Log • Avro • Thrift • ProtoBuff • Regex Tuesday, April 9, 13
  • 13.
    Relational Operators LIMIT GROUP FILTER CROSS COGROUP JOIN STORE DISTINCT FOREACH LOAD ORDER UNION Tuesday, April 9, 13
  • 14.
    Built In Functions COS SIN AVG SUM COUNT RANDOM LOWER UPPER CONCAT MAX MIN TOKENIZE Tuesday, April 9, 13
  • 15.
    User Defined Functions • Easy way to add arbitrary code to Pig • Eval - Filter, aggregate, or evaluate • Storage - Load/Store data • Full support for Java and Jython • Experimental support for Groovy, Ruby and JavaScript Tuesday, April 9, 13
  • 16.
  • 17.
  • 18.
    Convert to TSV ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB" • Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV • TSV > CSV Tuesday, April 9, 13
  • 19.
    Inspect Headers f = open('CSA_2010Census_DP1.tsv') header = f.readline() headers = header.strip('n').split('t') list(enumerate(headers)) [(0, 'WKT'), (1, 'GEOID10'), (2, 'NAMELSAD10'), (3, 'ALAND10'), (4, 'AWATER10'), (5, 'INTPTLAT10'), (6, 'INTPTLON10'), (7, 'DP0010001'), . . . Tuesday, April 9, 13
  • 20.
    Pig Quick Start • Download Pig Distribution • Untar package • Start Pig in local mode pig -x local grunt> ls file:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818 file:/data/CSA_2010Census_DP1.prj<r 1> 167 file:/data/CSA_2010Census_DP1.shp<r 1> 76180308 file:/data/CSA_2010Census_DP1.shx<r 1> 3596 file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058 http://pig.apache.org/releases.html https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads Tuesday, April 9, 13
  • 21.
    Loading Data grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); Tuesday, April 9, 13
  • 22.
    Extracting Data grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_types extracted_no_types: {name: bytearray,population: bytearray}; Tuesday, April 9, 13
  • 23.
    Adding Schema grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage(); grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int; grunt> describe extracted; extracted: {name: chararray,population: int} Tuesday, April 9, 13
  • 24.
    Ordering grunt> ordered = ORDER extracted by population DESC; grunt> dump ordered; ("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649) ("Los Angeles-Long Beach-Riverside, CA CSA",17877006) ("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021) ("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA", 8572971) ("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060) ("San Jose-San Francisco-Oakland, CA CSA",7468390) ("Dallas-Fort Worth, TX CSA",6731317) ("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683) Tuesday, April 9, 13
  • 25.
    Storing Data grunt> STORE extracted INTO 'extracted_data' USING PigStorage('t', '-schema'); ls -a .part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157 .part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158 .part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159 .part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160 Tuesday, April 9, 13
  • 26.
  • 27.
    Space Catalog • 14,000+ objects in public catalog • Use Two Line Element sets to propagate out positions and velocities • Can generate over 100 million positions & velocities per day Tuesday, April 9, 13
  • 28.
    Two Line Elements ISS (ZARYA) 1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 2927 2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537 • Use Python script to convert to Pig friendly TSV • Create Python UDF to parse TLE into parameters • Use Python UDF with Java libraries to propagate out positions Tuesday, April 9, 13
  • 29.
    Python UDFs • Easy way to extend Pig with new functions • Uses Jython which is at Python 2.5 • Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...) • Can use Java classes Tuesday, April 9, 13
  • 30.
    TLE parsing BSTAR Drag 54-61 -11606-4 (Decimal Assumed) def  parse_tle_number(tle_number_string):        split_string  =  tle_number_string.split('-­‐')        if  len(split_string)  ==  3:                new_number  =  '-­‐'  +  str(split_string[1])  +  'e-­‐'  +  str(int(split_string[2])+1)        elif  len(split_string)  ==  2:                new_number  =  str(split_string[0])  +  'e-­‐'  +  str(int(split_string[1])+1)        elif  len(split_string)  ==  1:                new_number  =  '0.'  +  str(split_string[0])        else:                raise  TypeError('Input  is  not  in  the  TLE  float  format')          return  float(new_number) Full parser at https://gist.github.com/shawnhermans/4569360 Tuesday, April 9, 13
  • 31.
    Simple UDF import tleparser @outputSchema("params:map[]") def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return params Tuesday, April 9, 13
  • 32.
    Extract Parameters grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs; grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); ([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year# 2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification #U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315]) Tuesday, April 9, 13
  • 33.
    Storing Results grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema'); Tuesday, April 9, 13
  • 34.
    UDF with JavaImport from jsattrak.objects import SatelliteTleSGP4 @outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}") def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions)) current_time += increment return ecef_positions Tuesday, April 9, 13
  • 35.
    Propagate Positions grunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs; grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100); grunt> flattened = FOREACH propagated GENERATE params#'satellite_number', FLATTEN(propagated); propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}} grunt> DESCRIBE flattened; flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double} Tuesday, April 9, 13
  • 36.
    Result (38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7) (38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161) (38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946) (38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064) (38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039) Tuesday, April 9, 13
  • 37.
    Pig on AmazonEMR Tuesday, April 9, 13
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
    Pig with EMR • SSH in to box to run interactive Pig session • Load data to/from S3 • Run standalone Pig scripts on demand Tuesday, April 9, 13
  • 45.
  • 46.
    Other Useful Tools • Python-dateutil : Super-duper date parser • Oozie : Hadoop workflow engine • Piggybank and Elephant Bird : 3rd party pig libraries • Chardet: Character detection library for Python Tuesday, April 9, 13
  • 47.
    Parting Thoughts • Great ETL tool/language • Flexible enough to write general purpose MapReduce jobs • Limited, but emerging 3rd party libraries • Jython for UDFs is extremely limiting (Spark?) Twitter: @shawnhermans Email: shawnhermans@gmail.com Tuesday, April 9, 13