Pig and Python to Process Big Data

Big Data with Pig
and Python
Shawn Hermans
Omaha Dynamic Languages User Group
April 8th, 2013

Tuesday, April 9, 13

About Me

• Mathematician/Physicist turned Consultant
• Graduate Student in CS at UNO
• Current Software Engineer at Sojern


Working with Big Data


What is Big Data?
Data Source Size
Gigabytes -
Wikipedia Database Dump 9GB Normal size for relational
databases
Open Street Map 19GB
Terabytes -
Relational databases may
Common Crawl 81TB start to experience scaling
issues
1000 Genomes 200TB
Petabytes -
Relational databases
Large Hadron Collider 15PB annually struggle to scale without a
lot of ﬁne tuning


Working With Data
Expectation Reality
• Different File Formats

• Missing Values

• Inconsistent Schema

• Loosely Structured

• Lots of it


MapReduce
• Map - Emit key/
value pairs from
data

• Reduce - Collect
data with common
keys

• Tries to minimize
moving data
between nodes

Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview


MapReduce Issues

• Very low-level abstraction
• Cumbersome Java API
• Unfamiliar to data analysts
• Rudimentary support for data pipelines


Pig
• Eats anything
• SQL-like, procedural data ﬂow language
• Extensible with Java, Jython, Groovy, Ruby
or JavaScript
• Provides opportunities to optimize
workﬂows


Alternatives
• Java MapReduce API
• Hadoop Streaming
• Hive
• Spark
• Cascading
• Cascalog

Python

• Data analysis - pandas, numpy, networkx
• Machine learning - scikits.learn, milk
• Scientiﬁc - scipy, pyephem, astropysics
• Visualization - matplotlib, d3py, ggplot


Pig Features


Input/Output
• HBase • Sequence File
• JDBC Database • Hive Columnar
• JSON • XML
• CSV/TSV • Apache Log
• Avro • Thrift
• ProtoBuff • Regex

Relational Operators
LIMIT GROUP FILTER CROSS

COGROUP JOIN STORE DISTINCT

FOREACH LOAD ORDER UNION


Built In Functions
COS SIN AVG SUM

COUNT RANDOM LOWER UPPER

CONCAT MAX MIN TOKENIZE


User Deﬁned Functions
• Easy way to add arbitrary code to Pig
• Eval - Filter, aggregate, or evaluate
• Storage - Load/Store data
• Full support for Java and Jython
• Experimental support for Groovy, Ruby and
JavaScript


Census Example


Getting Data


Convert to TSV
ogr2ogr -f "CSV" CSA_2010Census_DP1.csv CSA_2010Census_DP1.shp -lco "GEOMETRY=AS_WKT" -lco "SEPARATOR=TAB"

• Uses Geospatial Data Abstraction Library
(GDAL) to convert to TSV
• TSV > CSV


Inspect Headers
f = open('CSA_2010Census_DP1.tsv')
header = f.readline()
headers = header.strip('n').split('t')
list(enumerate(headers))

[(0, 'WKT'),
(1, 'GEOID10'),
(2, 'NAMELSAD10'),
(3, 'ALAND10'),
(4, 'AWATER10'),
(5, 'INTPTLAT10'),
(6, 'INTPTLON10'),
(7, 'DP0010001'),
.
.
.


Pig Quick Start
• Download Pig Distribution

• Untar package

• Start Pig in local mode
pig -x local
grunt> ls
file:/data/CSA_2010Census_DP1 1.dbf<r 1> 841818
file:/data/CSA_2010Census_DP1.prj<r 1> 167
file:/data/CSA_2010Census_DP1.shp<r 1> 76180308
file:/data/CSA_2010Census_DP1.shx<r 1> 3596
file:/data/CSA_2010Census_DP1.tsv<r 1> 111224058

http://pig.apache.org/releases.html

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads


Loading Data

grunt> csas = LOAD 'CSA_2010Census_DP1.tsv' USING PigStorage();


Extracting Data
grunt> extracted_no_types = FOREACH csas GENERATE $2
AS name, $7 as population;
grunt> describe extracted_no_types
extracted_no_types: {name: bytearray,population: bytearray};


Adding Schema
grunt> extracted = FOREACH csas GENERATE $2
AS name:chararray, $7 as population:int;
grunt> describe extracted;
extracted: {name: chararray,population: int}


Ordering
grunt> ordered = ORDER extracted by population DESC;
grunt> dump ordered;

("New York-Newark-Bridgeport, NY-NJ-CT-PA CSA",22085649)
("Los Angeles-Long Beach-Riverside, CA CSA",17877006)
("Chicago-Naperville-Michigan City, IL-IN-WI CSA",9686021)
("Washington-Baltimore-Northern Virginia, DC-MD-VA-WV CSA",
8572971)
("Boston-Worcester-Manchester, MA-RI-NH CSA",7559060)
("San Jose-San Francisco-Oakland, CA CSA",7468390)
("Dallas-Fort Worth, TX CSA",6731317)
("Philadelphia-Camden-Vineland, PA-NJ-DE-MD CSA",6533683)


Storing Data
grunt> STORE extracted INTO 'extracted_data' USING PigStorage('t', '-schema');

ls -a
.part-m-00035.crc .part-m-00115.crc .pig_header part-m-00077 part-m-00157
.part-m-00036.crc .part-m-00116.crc .pig_schema part-m-00078 part-m-00158
.part-m-00037.crc .part-m-00117.crc _SUCCESS part-m-00079 part-m-00159
.part-m-00038.crc .part-m-00118.crc part-m-00000 part-m-00080 part-m-00160


Space Catalog Example


Space Catalog
• 14,000+ objects in public catalog
• Use Two Line Element sets to propagate
out positions and velocities
• Can generate over 100 million positions &
velocities per day


Two Line Elements
ISS (ZARYA)
1 25544U 98067A 08264.51782528 −.00002182 00000-0 -11606-4 0 2927
2 25544 51.6416 247.4627 0006703 130.5360 325.0288 15.72125391563537

• Use Python script to convert to Pig friendly TSV
• Create Python UDF to parse TLE into parameters
• Use Python UDF with Java libraries to propagate out
positions


Python UDFs
• Easy way to extend Pig with new functions
• Uses Jython which is at Python 2.5
• Cannot take advantage of libraries with C
dependencies (e.g. numpy, scikits, etc...)
• Can use Java classes


TLE parsing
BSTAR Drag
54-61 -11606-4
(Decimal Assumed)

def
parse_tle_number(tle_number_string):

split_string
=
tle_number_string.split('-‐')

if
len(split_string)
==
3:

new_number
=
'-‐'
+
str(split_string[1])
+
'e-‐'
+
str(int(split_string[2])+1)

elif
len(split_string)
==
2:

new_number
=
+
'e-‐'
+
str(int(split_string[1])+1)

elif
len(split_string)
==
1:

new_number
=
'0.'
+

else:

raise
TypeError('Input
is
not
in
the
TLE
float
format')

return
float(new_number)

Full parser at https://gist.github.com/shawnhermans/4569360


Simple UDF
import tleparser

@outputSchema("params:map[]")
def parseTle(name, line1, line2):
params = tleparser.parse_tle(name, line1, line2)
return params


Extract Parameters
grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
AS (name:chararray, line1:chararray, line2:chararray);
grunt> REGISTER 'tleUDFs.py' USING jython AS myfuncs;
grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);

([bstar#,arg_of_perigee#333.0924,mean_motion#2.00559335,element_number#72,epoch_year#
2013,inclination#54.9673,mean_anomaly#26.8787,rev_at_epoch#210,mean_motion_ddot#0.0,e
ccentricity#5.354E-4,two_digit_year#13,international_designator#12053A,classification
#U,epoch_day#17.78040066,satellite_number#38833,name#GPS BIIF-3 (PRN
24),mean_motion_dot#-1.8E-6,ra_of_asc_node#344.5315])


Storing Results

grunt> parsed = FOREACH gps GENERATE myfuncs.parseTle(*);
grunt> STORE parsed INTO 'propagated-csv' using PigStorage(',','-schema');


UDF with Java Import
from jsattrak.objects import SatelliteTleSGP4

@outputSchema("propagated:bag{positions:tuple(time:double, x:double, y:double, z:double)}")
def propagateTleECEF(name,line1,line2,start_time,end_time,number_of_points):
satellite = SatelliteTleSGP4(name, line1, line2)
ecef_positions = []
increment = (float(end_time)-float(start_time))/float(number_of_points)
current_time = start_time

while current_time <= end_time:
positions = [current_time]
positions.extend(list(satellite.calculateJ2KPositionFromUT(current_time)))
ecef_positions.append(tuple(positions))

current_time += increment

return ecef_positions


Propagate Positions
grunt > REGISTER 'tleUDFs.py' USING jython AS myfuncs;
grunt> gps = LOAD 'gps-ops.tsv' USING PigStorage()
AS (name:chararray, line1:chararray, line2:chararray);
grunt> propagated = FOREACH gps GENERATE myfuncs.parseTle(name, line1, line2),
myfuncs.propagateTleECEF(name, line1, line2, 2454992.0, 2454993.0, 100);
grunt> flattened = FOREACH propagated
GENERATE params#'satellite_number', FLATTEN(propagated);
propagated: {params: map[],propagated: {positions:
(time: double,x: double,y: double,z: double)}}
grunt> DESCRIBE flattened;
flattened: {bytearray,propagated::time: double,propagated::x: double,
propagated::y: double,propagated::z: double}


Result

(38833,2454992.9599999785,2.278136816721697E7,7970303.195970464,-1.1066153998664627E7)
(38833,2454992.9699999783,2.2929498370345607E7,1.0245812732430315E7,-8617450.742994161)
(38833,2454992.979999978,2.2713614118860725E7,1.2358665040019082E7,-6031915.392826946)
(38833,2454992.989999978,2.213715624812226E7,1.4275325605036272E7,-3350605.7983842064)
(38833,2454992.9999999776,2.1209296863515433E7,1.5965381866069315E7,-616098.4598421039)


Pig on Amazon EMR


Pig with EMR


Pig with EMR

• SSH in to box to run interactive Pig session
• Load data to/from S3
• Run standalone Pig scripts on demand


Conclusion


Other Useful Tools
• Python-dateutil : Super-duper date parser
• Oozie : Hadoop workﬂow engine
• Piggybank and Elephant Bird : 3rd party pig
libraries
• Chardet: Character detection library for
Python


Parting Thoughts
• Great ETL tool/language

• Flexible enough to write general purpose
MapReduce jobs

• Limited, but emerging 3rd party libraries

• Jython for UDFs is extremely limiting (Spark?)

Twitter: @shawnhermans
Email: shawnhermans@gmail.com


Pig and Python to Process Big Data

More Related Content

What's hot

Viewers also liked

Similar to Pig and Python to Process Big Data

Recently uploaded

Pig and Python to Process Big Data