SlideShare a Scribd company logo
1 of 26
Download to read offline
Apache Pig
on Amazon AWS
Swine Not?
What is Apache Pig?
Pig is an execution framework that interprets
scripts written in a language called Pig Latin
and then runs them on a Hadoop cluster.
(Disturbing
Logo)
--
>
Pig is a tool that...
● creates complex jobs that efficiently process
large volumes of data
● supports many relational features, making it
easy to join, group, and aggregate data
● performs ETL tasks quickly, on many
servers simultaneously
What is Pig Latin?
It is a high level data transformation language
that:
● allows you to concentrate on the data
transformations you require
Rather than:
● force you to be concerned with individual
map and reduce functions
Walkthrough - Create a Job Flow
* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
And now we wait...
SSH into master instance
$ ssh -i ~/keys/crocs.pem -l hadoop  ec2-54-215-
107-197.us-west-1.compute.amazonaws.com
Type "pig" to enter the grunt shell
$ pig
grunt> _
It's a freakin' shell!
grunt> pwd
hdfs://10.174.115.214:9000/
You can enter the HDFS file system:
grunt> cd hdfs:///
grunt> ls
hdfs://10.174.115.214:9000/mnt <dir>
Even enter an S3 bucket:
grunt> cd  s3://elasticmapreduce/samples/pig-
apache/input/
grunt> ls
s3://elasticmapreduce/samples/pig-
apache/input/access_log_1<r 1> 8754118
s3://elasticmapreduce/samples/pig-
apache/input/access_log_2<r 1> 8902171
Load Piggybank - Open source library, user
contributed functions
grunt> register file:
/home/hadoop/lib/pig/piggybank.jar
DEFINE the EXTRACT alias from piggybank
grunt> DEFINE EXTRACT org.apache.pig.
piggybank.evaluation.string.EXTRACT;
LOAD
Use TextLoader (internal Pig function) to Load
each line of the source file:
grunt> RAW_LOGS = LOAD 's3:
//elasticmapreduce/samples/pig-
apache/input/access_log_1' USING TextLoader as
(line:chararray);
ILLUSTRATE
Shows a step-by-step process on how Pig would
transform a small sample of data
grunt> illustrate RAW_LOGS;
Connecting to hadoop file system at: hdfs://10.174.115.214:9000
Connecting to map-reduce job tracker at: 10.174.115.214:9001
...
---------------------------------------------------------------
| RAW_LOGS | line:chararray |
---------------------------------------------------------------
| | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700]
"GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-"
"msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
---------------------------------------------------------------
Now let's:
● split each line into fields
● store everything in a bag
grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s
[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"')
)
as (
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
request: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
browser: chararray
);
ILLUSTRATE an example of our work
grunt> illustrate LOGS_BASE;
...
| LOGS_BASE |
| remoteAddr:chararray | 74.125.74.193
| remoteLogname:chararray | -
| user:chararray | -
| time:chararray | 20/Jul/2009:20:30:55 -0700
| request:chararray | GET /gwidgets/alexa.xml HTTP/1.1
| status:int | 200
| bytes_string:chararray | 2969
| referrer:chararray | -
| browser:chararray | Mozilla/5.0 (compatible)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
Create a bag containing tuples with just the
referrer element (limit 10 items):
grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;
grunt> TEMP = LIMIT REFERRER_ONLY 10;
Output the contents of the bag:
grunt> DUMP TEMP;
Pig features used in the script: LIMIT
File concatenation threshold: 100 optimistic? false
MR plan size before optimization: 1
MR plan size after optimization: 1
Pig script settings are added to the job
creating jar file Job5394669249002614476.jar
Setting up single store job
1 map-reduce job(s) waiting for submission.
...
More log output before we get our results (cleaned
up here)
...
Input(s):
Successfully read 39344 records (126 bytes) from: "s3:
//elasticmapreduce/samples/pig-apache/input/access_log_1"
Output(s):
Successfully stored 10 records (126 bytes) in: "hdfs://10.
174.115.214:9000/tmp/temp948493830/tmp76754790"
Counters:
Total records written : 10
...
Voila! Our exciting results:
(-)
(-)
(-)
(-)
(-)
(-)
(http://example.org/)
(http://example.org/)
(-)
(-)
First 10 referrers (the dashes represent no
referrer)
Now let's filter only by referrerals from bing.com*
grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '.
*bing.*';
grunt> TEMP = LIMIT FILTERED 9;
grunt> DUMP TEMP;
(http://www.bing.com/search?q=login)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=value)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=views)
(http://www.bing.com/search?q=search)
(http://www.bing.com/search?q=philmont)
* We all use Bing, am I right?
Don't forget to terminate your Job
Flow
Amazon will charge you even if it's idle!

More Related Content

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Apache Pig on Amazon AWS - Swine Not?

  • 1. Apache Pig on Amazon AWS Swine Not?
  • 2. What is Apache Pig? Pig is an execution framework that interprets scripts written in a language called Pig Latin and then runs them on a Hadoop cluster. (Disturbing Logo) -- >
  • 3. Pig is a tool that... ● creates complex jobs that efficiently process large volumes of data ● supports many relational features, making it easy to join, group, and aggregate data ● performs ETL tasks quickly, on many servers simultaneously
  • 4. What is Pig Latin? It is a high level data transformation language that: ● allows you to concentrate on the data transformations you require Rather than: ● force you to be concerned with individual map and reduce functions
  • 5. Walkthrough - Create a Job Flow * Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. And now we wait...
  • 14. SSH into master instance $ ssh -i ~/keys/crocs.pem -l hadoop ec2-54-215- 107-197.us-west-1.compute.amazonaws.com
  • 15. Type "pig" to enter the grunt shell $ pig grunt> _ It's a freakin' shell! grunt> pwd hdfs://10.174.115.214:9000/
  • 16. You can enter the HDFS file system: grunt> cd hdfs:/// grunt> ls hdfs://10.174.115.214:9000/mnt <dir> Even enter an S3 bucket: grunt> cd s3://elasticmapreduce/samples/pig- apache/input/ grunt> ls s3://elasticmapreduce/samples/pig- apache/input/access_log_1<r 1> 8754118 s3://elasticmapreduce/samples/pig- apache/input/access_log_2<r 1> 8902171
  • 17. Load Piggybank - Open source library, user contributed functions grunt> register file: /home/hadoop/lib/pig/piggybank.jar DEFINE the EXTRACT alias from piggybank grunt> DEFINE EXTRACT org.apache.pig. piggybank.evaluation.string.EXTRACT;
  • 18. LOAD Use TextLoader (internal Pig function) to Load each line of the source file: grunt> RAW_LOGS = LOAD 's3: //elasticmapreduce/samples/pig- apache/input/access_log_1' USING TextLoader as (line:chararray);
  • 19. ILLUSTRATE Shows a step-by-step process on how Pig would transform a small sample of data grunt> illustrate RAW_LOGS; Connecting to hadoop file system at: hdfs://10.174.115.214:9000 Connecting to map-reduce job tracker at: 10.174.115.214:9001 ... --------------------------------------------------------------- | RAW_LOGS | line:chararray | --------------------------------------------------------------- | | 65.55.106.160 - - [21/Jul/2009:02:29:56 -0700] "GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" ---------------------------------------------------------------
  • 20. Now let's: ● split each line into fields ● store everything in a bag grunt> LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( EXTRACT(line, '^(S+) (S+) (S+) [([w:/]+s [+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"') ) as ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray );
  • 21. ILLUSTRATE an example of our work grunt> illustrate LOGS_BASE; ... | LOGS_BASE | | remoteAddr:chararray | 74.125.74.193 | remoteLogname:chararray | - | user:chararray | - | time:chararray | 20/Jul/2009:20:30:55 -0700 | request:chararray | GET /gwidgets/alexa.xml HTTP/1.1 | status:int | 200 | bytes_string:chararray | 2969 | referrer:chararray | - | browser:chararray | Mozilla/5.0 (compatible) Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
  • 22. Create a bag containing tuples with just the referrer element (limit 10 items): grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer; grunt> TEMP = LIMIT REFERRER_ONLY 10; Output the contents of the bag: grunt> DUMP TEMP; Pig features used in the script: LIMIT File concatenation threshold: 100 optimistic? false MR plan size before optimization: 1 MR plan size after optimization: 1 Pig script settings are added to the job creating jar file Job5394669249002614476.jar Setting up single store job 1 map-reduce job(s) waiting for submission. ...
  • 23. More log output before we get our results (cleaned up here) ... Input(s): Successfully read 39344 records (126 bytes) from: "s3: //elasticmapreduce/samples/pig-apache/input/access_log_1" Output(s): Successfully stored 10 records (126 bytes) in: "hdfs://10. 174.115.214:9000/tmp/temp948493830/tmp76754790" Counters: Total records written : 10 ...
  • 24. Voila! Our exciting results: (-) (-) (-) (-) (-) (-) (http://example.org/) (http://example.org/) (-) (-) First 10 referrers (the dashes represent no referrer)
  • 25. Now let's filter only by referrerals from bing.com* grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches '. *bing.*'; grunt> TEMP = LIMIT FILTERED 9; grunt> DUMP TEMP; (http://www.bing.com/search?q=login) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=value) (http://www.bing.com/search?q=views) (http://www.bing.com/search?q=views) (http://www.bing.com/search?q=search) (http://www.bing.com/search?q=philmont) * We all use Bing, am I right?
  • 26. Don't forget to terminate your Job Flow Amazon will charge you even if it's idle!