Your SlideShare is downloading. ×
Pig and Python to Process Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Pig and Python to Process Big Data

6,667
views

Published on

April 8th, 2013 Presentation to Omaha Dynamic Languages User Group

April 8th, 2013 Presentation to Omaha Dynamic Languages User Group

Published in: Technology

1 Comment
10 Likes
Statistics
Notes
  • Slide 23:
    How can this example work? This syntax is not currently supported: https://issues.apache.org/jira/browse/PIG-2315
    Was the DESCRIBE output real?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
6,667
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
84
Comments
1
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data with Pig and Python Shawn Hermans Omaha Dynamic Languages User Group April 8th, 2013Tuesday, April 9, 13
  • 2. About Me • Mathematician/Physicist turned Consultant • Graduate Student in CS at UNO • Current Software Engineer at SojernTuesday, April 9, 13
  • 3. Working with Big DataTuesday, April 9, 13
  • 4. What is Big Data? Data Source Size Gigabytes - Wikipedia Database Dump 9GB Normal size for relational databases Open Street Map 19GB Terabytes - Relational databases may Common Crawl 81TB start to experience scaling issues 1000 Genomes 200TB Petabytes - Relational databases Large Hadron Collider 15PB annually struggle to scale without a lot of fine tuningTuesday, April 9, 13
  • 5. Working With Data Expectation Reality • Different File Formats • Missing Values • Inconsistent Schema • Loosely Structured • Lots of itTuesday, April 9, 13
  • 6. MapReduce • Map - Emit key/ value pairs from data • Reduce - Collect data with common keys • Tries to minimize moving data between nodes Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overviewTuesday, April 9, 13
  • 7. MapReduce Issues • Very low-level abstraction • Cumbersome Java API • Unfamiliar to data analysts • Rudimentary support for data pipelinesTuesday, April 9, 13
  • 8. Pig • Eats anything • SQL-like, procedural data flow language • Extensible with Java, Jython, Groovy, Ruby or JavaScript • Provides opportunities to optimize workflowsTuesday, April 9, 13
  • 9. Alternatives • Java MapReduce API • Hadoop Streaming • Hive • Spark • Cascading • CascalogTuesday, April 9, 13
  • 10. Python • Data analysis - pandas, numpy, networkx • Machine learning - scikits.learn, milk • Scientific - scipy, pyephem, astropysics • Visualization - matplotlib, d3py, ggplotTuesday, April 9, 13
  • 11. Pig FeaturesTuesday, April 9, 13
  • 12. Input/Output • HBase • Sequence File • JDBC Database • Hive Columnar • JSON • XML • CSV/TSV • Apache Log • Avro • Thrift • ProtoBuff • RegexTuesday, April 9, 13
  • 13. Relational Operators LIMIT GROUP FILTER CROSS COGROUP JOIN STORE DISTINCT FOREACH LOAD ORDER UNIONTuesday, April 9, 13
  • 14. Built In Functions COS SIN AVG SUM COUNT RANDOM LOWER UPPER CONCAT MAX MIN TOKENIZETuesday, April 9, 13
  • 15. User Defined Functions • Easy way to add arbitrary code to Pig • Eval - Filter, aggregate, or evaluate • Storage - Load/Store data • Full support for Java and Jython • Experimental support for Groovy, Ruby and JavaScriptTuesday, April 9, 13
  • 16. Census ExampleTuesday, April 9, 13
  • 17. Getting DataTuesday, April 9, 13
  • 18. Convert to TSV ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB" • Uses Geospatial Data Abstraction Library (GDAL) to convert to TSV • TSV > CSVTuesday, April 9, 13
  • 19. Inspect Headers f = open(CSA_2010Census_DP1.tsv) header = f.readline() headers = header.strip(n).split(t) list(enumerate(headers)) [(0, WKT), (1, GEOID10), (2, NAMELSAD10), (3, ALAND10), (4, AWATER10), (5, INTPTLAT10), (6, INTPTLON10), (7, DP0010001), . . .Tuesday, April 9, 13
  • 20. Pig Quick Start • Download Pig Distribution • Untar package • Start Pig in local mode pig -x local grunt> ls file:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818 file:/data/CSA_2010Census_DP1.prj<r 1> 167 file:/data/CSA_2010Census_DP1.shp<r 1> 76180308 file:/data/CSA_2010Census_DP1.shx<r 1> 3596 file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058 http://pig.apache.org/releases.html https://ccp.cloudera.com/display/SUPPORT/CDH+DownloadsTuesday, April 9, 13
  • 21. Loading Data grunt> csas = LOAD CSA_2010Census_DP1.tsv USING PigStorage();Tuesday, April 9, 13
  • 22. Extracting Data grunt> csas = LOAD CSA_2010Census_DP1.tsv USING PigStorage(); grunt> extracted_no_types = FOREACH csas GENERATE $2 AS name, $7 as population; grunt> describe extracted_no_types extracted_no_types: {name: bytearray,population: bytearray};Tuesday, April 9, 13
  • 23. Adding Schema grunt> csas = LOAD CSA_2010Census_DP1.tsv USING PigStorage(); grunt> extracted = FOREACH csas GENERATE $2 AS name:chararray, $7 as population:int; grunt> describe extracted; extracted: {name: chararray,population: int}Tuesday, April 9, 13
  • 24. Ordering grunt> ordered = ORDER extracted by population DESC; grunt> dump ordered; ("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649) ("Los Angeles-Long Beach-Riverside, CA CSA",17877006) ("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021) ("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA", 8572971) ("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060) ("San Jose-San Francisco-Oakland, CA CSA",7468390) ("Dallas-Fort Worth, TX CSA",6731317) ("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)Tuesday, April 9, 13
  • 25. Storing Data grunt> STORE extracted INTO extracted_data USING PigStorage(t, -schema); ls -a .part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157 .part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158 .part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159 .part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160Tuesday, April 9, 13
  • 26. Space Catalog ExampleTuesday, April 9, 13
  • 27. Space Catalog • 14,000+ objects in public catalog • Use Two Line Element sets to propagate out positions and velocities • Can generate over 100 million positions & velocities per dayTuesday, April 9, 13
  • 28. Two Line Elements ISS (ZARYA) 1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 2927 2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537 • Use Python script to convert to Pig friendly TSV • Create Python UDF to parse TLE into parameters • Use Python UDF with Java libraries to propagate out positionsTuesday, April 9, 13
  • 29. Python UDFs • Easy way to extend Pig with new functions • Uses Jython which is at Python 2.5 • Cannot take advantage of libraries with C dependencies (e.g. numpy, scikits, etc...) • Can use Java classesTuesday, April 9, 13
  • 30. TLE parsing BSTAR Drag 54-61 -11606-4 (Decimal Assumed) def  parse_tle_number(tle_number_string):        split_string  =  tle_number_string.split(-­‐)        if  len(split_string)  ==  3:                new_number  =  -­‐  +  str(split_string[1])  +  e-­‐  +  str(int(split_string[2])+1)        elif  len(split_string)  ==  2:                new_number  =  str(split_string[0])  +  e-­‐  +  str(int(split_string[1])+1)        elif  len(split_string)  ==  1:                new_number  =  0.  +  str(split_string[0])        else:                raise  TypeError(Input  is  not  in  the  TLE  float  format)          return  float(new_number) Full parser at https://gist.github.com/shawnhermans/4569360Tuesday, April 9, 13
  • 31. Simple UDF import tleparser @outputSchema("params:map[]") def parseTle(name, line1, line2): params = tleparser.parse_tle(name, line1, line2) return paramsTuesday, April 9, 13
  • 32. Extract Parameters grunt> gps = LOAD gps-ops.tsv USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> REGISTER tleUDFs.py USING jython AS myfuncs; grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); ([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year# 2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification #U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN 24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])Tuesday, April 9, 13
  • 33. Storing Results grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*); grunt> STORE parsed INTO propagated-csv using PigStorage(,,-schema);Tuesday, April 9, 13
  • 34. UDF with Java Import from jsattrak.objects import SatelliteTleSGP4 @outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}") def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points): satellite = SatelliteTleSGP4(name, line1, line2) ecef_positions = [] increment = (float(end_time)-float(start_time))/float(number_of_points) current_time = start_time while current_time <= end_time: positions = [current_time] positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time))) ecef_positions.append(tuple(positions)) current_time += increment return ecef_positionsTuesday, April 9, 13
  • 35. Propagate Positions grunt > REGISTER tleUDFs.py USING jython AS myfuncs; grunt> gps = LOAD gps-ops.tsv USING PigStorage() AS (name:chararray, line1:chararray, line2:chararray); grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2), myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100); grunt> flattened = FOREACH propagated GENERATE params#satellite_number, FLATTEN(propagated); propagated: {params: map[],propagated: {positions: (time: double,x: double,y: double,z: double)}} grunt> DESCRIBE flattened; flattened: {bytearray,propagated::time: double,propagated::x: double, propagated::y: double,propagated::z: double}Tuesday, April 9, 13
  • 36. Result (38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7) (38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161) (38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946) (38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064) (38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)Tuesday, April 9, 13
  • 37. Pig on Amazon EMRTuesday, April 9, 13
  • 38. Tuesday, April 9, 13
  • 39. Tuesday, April 9, 13
  • 40. Tuesday, April 9, 13
  • 41. Tuesday, April 9, 13
  • 42. Tuesday, April 9, 13
  • 43. Pig with EMRTuesday, April 9, 13
  • 44. Pig with EMR • SSH in to box to run interactive Pig session • Load data to/from S3 • Run standalone Pig scripts on demandTuesday, April 9, 13
  • 45. ConclusionTuesday, April 9, 13
  • 46. Other Useful Tools • Python-dateutil : Super-duper date parser • Oozie : Hadoop workflow engine • Piggybank and Elephant Bird : 3rd party pig libraries • Chardet: Character detection library for PythonTuesday, April 9, 13
  • 47. Parting Thoughts • Great ETL tool/language • Flexible enough to write general purpose MapReduce jobs • Limited, but emerging 3rd party libraries • Jython for UDFs is extremely limiting (Spark?) Twitter: @shawnhermans Email: shawnhermans@gmail.comTuesday, April 9, 13