Dachis Group Pig Hackday: Pig 202
Upcoming SlideShare
Loading in...5
×
 

Dachis Group Pig Hackday: Pig 202

on

  • 1,430 views

Slides for Pig 202 tutorial presented by Timothy Potter at DG Pig Hackday, May 11, 2012

Slides for Pig 202 tutorial presented by Timothy Potter at DG Pig Hackday, May 11, 2012

Statistics

Views

Total Views
1,430
Views on SlideShare
1,420
Embed Views
10

Actions

Likes
1
Downloads
24
Comments
1

2 Embeds 10

http://www.linkedin.com 7
https://si0.twimg.com 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Free Download : http://gg.gg/114bb

    Hey Guyz and girls, Today I am gonna show you perfact tool, Remember This video is old, But the download link with hack is brand new. Its very simple to using this tool and here are some instructions in video. Please REDOWNLOAD. Don't Forget to Comment Subscribe & Rate My Video :)

    Virus Scan :- This file has been scanned with avast! Antivirus. -- Status: FILE IS CLEANN.

    Copyright © 2014. All Rights Reserved
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • UFO Sightings from Infochimps:http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metadaUS Cities / States with Geo-codes from census.gov:http://www.census.gov/geo/www/gazetteer/files/Gaz_places_national.txtStarted out as a new hire programming challenge
  • ufo_sightings = LOAD '/Users/timpotter/dev/data/ufo_awesome.tsv' AS ( sighted_at: chararray, reported_at: chararray, location: chararray, shape: chararray, duration: chararray, description: chararray );ufo_sightings_split_loc = FOREACH (FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL) { split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})', 1); split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][\\\\w\\\\s\\\\-\\\\.]*)(, )([A-Z]{2})', 3); city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null); state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);GENERATE city_lc AS city, state_lc AS state, sighted_at, SUBSTRING(sighted_at,0,4) AS sighted_year, reported_at, TRIM(shape) AS shape, duration, description;};ufo_sightings_by_city = FILTER ufo_sightings_split_loc BY city IS NOT NULL AND state IS NOT NULL;
  • us_cities = LOAD '/Users/timpotter/dev/data/usa_cities_and_towns.tsv' AS ( state: chararray, geoid: chararray, ansicode: chararray, name: chararray, aland: long, awater: long, aland_sqmi: double, awater_sqmi: double, latitude: double, longitude: double );us_cities_w_geo = FOREACH us_cities { city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' '));GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude; };
  • Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  • Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  • Seattle Image from http://mylocalhealthguide.com/north-seattle-group-targeting-underage-drinking-meets-dec-15/
  • Need to join our sightings data with US cities data to A) filter out non-US cities and B) attach the lat / lng to the sighting.Realize that after the JOIN, you’re data won’t be sorted if you use REPLICATED, it will be sorted if you don’t use replicated.To quote the Pig book – Pig’s optimizer is between your chair and your keyboard
  • When using CROSS, minimize the size of the relations you’re crossing – thus, I’m grouping by state + city + lat + lng and just flattening the group by keyWhen joining, list the largest relation on the left and smallest on the right
  • Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  • Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  • image: http://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Gradient_ascent_%28surface%29.png/450px-Gradient_ascent_%28surface%29.png
  • Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg
  • Image from: http://7fny.com/sub/m/m_pig_rider_xdRkS0SY.jpg

Dachis Group Pig Hackday: Pig 202 Dachis Group Pig Hackday: Pig 202 Presentation Transcript

  • dachisgroup.comDachis GroupLas Vegas 2012 Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.
  • dachisgroup.comAgenda UFO Sightings Data Set 1. Which US city has the most UFO sightings overall? 2. What is the most common UFO shape within a 100 mile radius of your answer for #1? Pig Mahout Example: Training 20 Newsgroups Classifier • Loading messages using a custom loader • Hashed Feature Vectors • Train Logistic Regression Model • Evaluate Model on held-out Data® 2011 Dachis Group.
  • dachisgroup.comUFO Sightings1. What US city has the most UFO sightings overall?2. What is the most common UFO shape within a 100 mile radius of your answer for #1?Using Two Data Sets:• UFO sightings data set available from Infochimps• US city / states with geo-codes available from US Census® 2011 Dachis Group.
  • dachisgroup.com Load Sightings Data19930809 19990816 Westminster, CO triangle 1 minute A white puffy cottonball appeared and then a triangle ...20010111 20010113 Pueblo, CO fireball 30 sec Blue fireball lights up the skies of colorado and nebraska ...20001026 20030920 Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ...ufo_sightings = LOAD ’ufo/ufo_awesome.tsv AS ( sighted_at: chararray, reported_at: chararray, location: chararray, shape: chararray, Pig provides functions duration: chararray, description: chararray for doing basic text ); munging tasks or use a UDF ...ufo_sightings_split_loc = FOREACH ( FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL ){ split_city = REGEX_EXTRACT(TRIM(location), ([A-Z][ws-.]*)(, )([A-Z]{2}), 1); split_state = REGEX_EXTRACT(TRIM(location), ([A-Z][ws-.]*)(, )([A-Z]{2}), 3); city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null); state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null); GENERATE city_lc AS city, state_lc AS state, ... ® 2011 Dachis Group.
  • dachisgroup.com Load US Cities Data with geo-codesCO 0862000 02411501 Pueblo city 138930097 2034229 53.641 0.785 38.273147 -104.612378CO 0883835 02412237 Westminster city 81715203 5954681 31.550 2.299 39.882190 -105.064426CO 0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002 -104.689740us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv AS ( state: chararray, geoid: chararray, Use projection to ansicode: chararray, name: chararray, select only the fields .... you want to work with: latitude: double, longitude: double city, state, latitude, longitude );us_cities_w_geo = FOREACH us_cities { city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name, )); GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude; }; ® 2011 Dachis Group.
  • dachisgroup.comWhat US city has the mostUFO sightings overall? Things to consider ... 1. Need to select only sightings from US cities Join sightings data with US city data 1. Need to count sightings for each city Group results from step 1 by state/city and count 2. Need to do a TOP to get the city with the most sightings Descending sort on count and choose the top.® 2011 Dachis Group.
  • dachisgroup.comWhat US city has the mostUFO sightings overall? ufo_sightings_with_geo = FOREACH ( JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’ ) GENERATE ufo_sightings_by_city::state AS state, Inner JOIN by ufo_sightings_by_city::city AS city, (state,city) to ufo_sightings_by_city::sighted_at AS sighted_at, attach geo-codes to sightings ufo_sightings_by_city::sighted_year AS sighted_year, ufo_sightings_by_city::shape AS shape, us_cities_w_geo::latitude AS latitude, us_cities_w_geo::longitude AS longitude; grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count; Group by (state,city) to get number of most_freq = ORDER grp_by_state_city BY the_count DESC; top_city_state = LIMIT most_freq 1; DUMP top_city_state; sightings for each Poor man’s TOP City® 2011 Dachis Group.
  • dachisgroup.comWhat US city has the mostUFO sightings overall? (seattle,wa,446,light,47.620499,-122.350876) Seattle only averages 58 sunny days a year. Coincidence? Maybe all the UFOs are coming to look at the Space Needle?® 2011 Dachis Group.
  • dachisgroup.comPig Explain: Pull back thecovers ... pig -x local -e ‘explain -script ufo.pig’ ufo_sightings_with_geo = FOREACH ( JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’ ) GENERATE ufo_sightings_by_city::state ufo_sightings_by_city::city AS state, AS city, Job 1 - Mapper ufo_sightings_by_city::sighted_at AS sighted_at, ufo_sightings_by_city::sighted_year AS sighted_year, ufo_sightings_by_city::shape AS shape, us_cities_w_geo::latitude AS latitude, us_cities_w_geo::longitude AS longitude; grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) Job 1 - Reducer GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count; most_freq = ORDER grp_by_state_city BY the_count DESC; top_city_state = LIMIT most_freq 1; Job 2 – Full Map/Reduce DUMP top_city_state;® 2011 Dachis Group.
  • dachisgroup.comWhat is the most commonUFO shape within a 100 mileradius of your answer for #1? Things we need to solve this ... 1) Some way to calculate geographical distance from a geographical location (lat / lng) 2) Iterate over all cities that have sightings to get the distance from our centroid 3) Filter by distance and count shapes® 2011 Dachis Group.
  • dachisgroup.comUDF: User Defined Function REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar; DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance(); ... with_distance = FOREACH calc_dist { GENERATE city, state, CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles; }; Let’s build a UDF that uses the Haversine Forumla to calculate distance between two points See: http://en.wikipedia.org/wiki/Haversine_formula® 2011 Dachis Group.
  • dachisgroup.comUDF: User Defined Function import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; public class GeoDistance extends EvalFunc<Double> { public Double exec(Tuple input) throws IOException { if (input == null || input.size() < 4 || input.isNull(0) || input.isNull(1) || input.isNull(2) || input.isNull(3)) { return null; } Double dist = null; try { Double fromLat = (Double)input.get(0); Double fromLng = (Double)input.get(1); Double toLat = (Double)input.get(2); Double toLng = (Double)input.get(3); dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng); } catch (Exception exc) { // better to return null than to throw exception } return dist; } protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) { // details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html return dist; }® 2011 Dachis Group.
  • dachisgroup.comWhat is the most commonUFO shape ... top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng; sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude); Including lat / lng in group by calc_dist = FOREACH (CROSS sighting_cities, top_city) GENERATE key to help reduce number of sighting_cities::city AS city, records I’m crossing sighting_cities::state AS state, sighting_cities::latitude AS to_lat, sighting_cities::longitude AS to_lng, Pig only supports equi-joins so we need to use CROSS CalcGeoDistance(top_city::from_lat, top_city::from_lng, sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles; near = FILTER calc_dist BY dist_in_miles < 100; to get the lat / lng of the two points to calculate distance using our UDF shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’) generate ufo_sightings_with_geo::shape as shape; count_shapes = FOREACH (GROUP shapes BY shape) GENERATE $0 AS shape, COUNT($1) AS the_count; When joining, list largest relation sorted_counts = ORDER count_shapes BY the_count DESC; first and smallest last and optimize if possible such as using ‘replicated’® 2011 Dachis Group.
  • dachisgroup.comVisualize Results In Pig: fs -getmerge sorted_counts sorted_counts.txt In R: shapes <- read.table(”sorted_counts.txt", header=F, sep="t", col.names=c("shape","occurs"), strin gsAsFactors=F) barplot(c(shapes$occurs), main="UFO Sightings (Shapes)", ylab="Number of Sightings", ylim=c(0,500), cex.names=0.8, las=2, names.arg=c(shapes$shape))® 2011 Dachis Group.
  • dachisgroup.comSet Logic in Pig Use Pig’s IsEmpty function to isolate records that only occur in one of the relations ... such as sightings in cities not in the US census list: city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER, us_cities_w_geo BY (state,city); outside_us_sightings = FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE FLATTEN(ufo_sightings_by_city);® 2011 Dachis Group.
  • dachisgroup.comMahout and Pig Example Integration: Pig-Vector GitHub project by Ted Dunning, Mahout Committer https://github.com/tdunning/pig-vector Use Case: Train Logistic Regression Model from Pig Hello World of ML – 20 Newsgroups® 2011 Dachis Group.
  • dachisgroup.comMahout and PigStep 1: Load the Training Data Load 20-newsgroups messages using custom Pig LoadFunc: docs = LOAD 20news-bydate-train/*/*’ USING org.apache.mahout.pig.MessageLoader() AS (newsgroup, id:int, subject, body); In Java: public class MessageLoader extends LoadFunc { public void setLocation(String location, Job job) throws IOException { // setup where were reading data from } public InputFormat getInputFormat() throws IOException { return new TextInputFormat() { // ... }; } public Tuple getNext() throws IOException { // parse message and build Tuple that matches the schema } }® 2011 Dachis Group.
  • dachisgroup.comMahout and PigStep 2: Vectorize using Pig-Vector UDF -- Import UDF, define vectorizing strategy and fixed size of feature vector DEFINE encodeVector org.apache.mahout.pig.encoders.EncodeVector(100000, subject+body, group: word, article:numeric, subject:text, body:text); vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v; Result is a hashed feature vector where features are mapped to indexes in a fixed size sparse vector (from Mahout) Fixed sized vectors are needed to train Mahout’s SGD-based logistic regression model® 2011 Dachis Group.
  • dachisgroup.comMahout and PigStep 3: Train the Model DEFINE train org.apache.mahout.pig.LogisticRegression(iterations=5, inMemory=true, feat ures=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc); /* put the training data in a single bag. We could train multiple models this way */ grouped = group vectors all; /* train the actual model. The key is bogus to satisfy the sequence vector format. */ model = foreach grouped generate 1 as key, train(vectors) as model; store model into pv-tmp/news_model using PigModelStorage();® 2011 Dachis Group.
  • dachisgroup.comMahout and PigStep 4: Evaluate the Model DEFINE evaluate org.apache.mahout.pig.LogisticRegressionEval(sequence=pv- tmp/news_model/part-r-00000, key=1); test = load 20news-bydate-test/*/* using org.apache.mahout.pig.MessageLoader() as (newsgroup, id:int, subject, body); testvecs = foreach test generate newsgroup, encodeVector(*) as v; evalvecs = foreach testvecs generate evaluate(v);® 2011 Dachis Group.
  • dachisgroup.comQuestions? For Slides and Pig script email me at: tim.potter@dachisgroup.com Twitter: thelabdude® 2011 Dachis Group.