SlideShare a Scribd company logo
1 of 15
Alan F. Gates Pig 0.8 New Features
Who am I? Pig committer and PMC member Architect in the grid team at Yahoo Photo credit:  Steven Guarnaccia, The Three Little Pigs
Focus of Pig 0.8 Usability Integration Performance Backwards compatibility with 0.7
UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately
Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num))  test.pig: register 'test.py' using jythonas myfuncs; A = load ‘input’ as (i:int); B = foreachA generate myfuncs.square(i); dump B;
Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage Loader for reading job history files included in Piggybank New PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats information Can also pass listener to track Pig jobs as they run Done for Oozie so it can show users Pig statistics
Sample stats info Job Stats (time in seconds): JobId   Maps  Reduces  MxMTMnMT  AMT  MxRTMnRT  ART  Alias      job_0   2     1        15    3     9    27    27    27   a,b,c,d,e job_1   1     1        3     3     3    12    12    12   g,h job_2   1     1        3     3     3    12    12    12   i job_3   1     1        3     3     3    12    12    12   i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150
Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecodeInvokeForString('java.net.URLDecoder.decode',    'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions
Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90.  Patch in PIG-1680 addresses this but has not been committed yet.
Casting Relations to Scalars Say you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.)   views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group,  COUNT(browser) / (long)numviews.total; Now it is possible to cast the relation numviewsto a scalar value for use in later calculations Pig handles storing the results in a file and retrieving it when needed Only works for single row results
Integrating MapReduce Jobs Sometimes you need to integrate MR and Pig jobs Legacy code Algorithm that’s hard to implement in Pig A = load 'WordcountInput.txt';  B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int)	`org.myorg.WordCountinputDiroutputDir`;  C = foreachB …
Plus a Whole Lot More Custom PartitionersB = group A by $0 partition by YourPartitionerparallel 2; Greatly expanded string and math built in UDFs Performance Improvements Automatic merging of small files Compression of intermediate results Safety Features Parallel set automatically when not specified Monitor your UDF by annotating it with @MonitoredUDF.  If it takes too long to return Pig will kill it and return a default value instead. PigUnit for unit testing your Pig Latin scripts
Plus Even More I Probably Don’t Have Time to Talk About New option for UNION to merge schemas Map side COGROUP DESCRIBE now works in nested FOREACH Local shell commands can now be run from Grunt Support for jars and scripts stored on dfs Arbitrary jobconf key-value pairs can be set inside Pig Latin script using SET Merge join extended Support for more than two tables for inner join Support for left, right, or full outer join for 2 tables ,[object Object],Significant memory improvements.
What’s Next? Preview of Pig 0.9 Integrate Pig with scripting languages for control flow Add macros to Pig Latin Revive ILLUSTRATE Fix most runtime type errors Rewrite parser to give useful error messages Programming Pig from O’Reilly Press
Acknowledgements Much of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8:  http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/ The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release

More Related Content

What's hot

Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
Hadoop User Group
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
Qubole
 

What's hot (20)

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
re:dash is awesome
re:dash is awesomere:dash is awesome
re:dash is awesome
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 

Viewers also liked

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 

Viewers also liked (6)

January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 

Similar to January 2011 HUG: Pig Presentation

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Deployment with Fabric
Deployment with FabricDeployment with Fabric
Deployment with Fabric
andymccurdy
 

Similar to January 2011 HUG: Pig Presentation (20)

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Pig
PigPig
Pig
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Introduction to Google App Engine with Python
Introduction to Google App Engine with PythonIntroduction to Google App Engine with Python
Introduction to Google App Engine with Python
 
Deployment with Fabric
Deployment with FabricDeployment with Fabric
Deployment with Fabric
 
Deploy your Python code on Azure Functions
Deploy your Python code on Azure FunctionsDeploy your Python code on Azure Functions
Deploy your Python code on Azure Functions
 
NodeJS
NodeJSNodeJS
NodeJS
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008Groovy Introduction - JAX Germany - 2008
Groovy Introduction - JAX Germany - 2008
 
Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007Groovy Update - JavaPolis 2007
Groovy Update - JavaPolis 2007
 
A DevOps guide to Kubernetes
A DevOps guide to KubernetesA DevOps guide to Kubernetes
A DevOps guide to Kubernetes
 
Sa
SaSa
Sa
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Scripting Oracle Develop 2007
Scripting Oracle Develop 2007Scripting Oracle Develop 2007
Scripting Oracle Develop 2007
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for DevelopersMSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
MSDN Presents: Visual Studio 2010, .NET 4, SharePoint 2010 for Developers
 

More from Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

January 2011 HUG: Pig Presentation

  • 1. Alan F. Gates Pig 0.8 New Features
  • 2. Who am I? Pig committer and PMC member Architect in the grid team at Yahoo Photo credit: Steven Guarnaccia, The Three Little Pigs
  • 3. Focus of Pig 0.8 Usability Integration Performance Backwards compatibility with 0.7
  • 4. UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately
  • 5. Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num)) test.pig: register 'test.py' using jythonas myfuncs; A = load ‘input’ as (i:int); B = foreachA generate myfuncs.square(i); dump B;
  • 6. Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage Loader for reading job history files included in Piggybank New PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats information Can also pass listener to track Pig jobs as they run Done for Oozie so it can show users Pig statistics
  • 7. Sample stats info Job Stats (time in seconds): JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,e job_1 1 1 3 3 3 12 12 12 g,h job_2 1 1 3 3 3 12 12 12 i job_3 1 1 3 3 3 12 12 12 i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150
  • 8. Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecodeInvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions
  • 9. Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.
  • 10. Casting Relations to Scalars Say you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.) views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, COUNT(browser) / (long)numviews.total; Now it is possible to cast the relation numviewsto a scalar value for use in later calculations Pig handles storing the results in a file and retrieving it when needed Only works for single row results
  • 11. Integrating MapReduce Jobs Sometimes you need to integrate MR and Pig jobs Legacy code Algorithm that’s hard to implement in Pig A = load 'WordcountInput.txt'; B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCountinputDiroutputDir`; C = foreachB …
  • 12. Plus a Whole Lot More Custom PartitionersB = group A by $0 partition by YourPartitionerparallel 2; Greatly expanded string and math built in UDFs Performance Improvements Automatic merging of small files Compression of intermediate results Safety Features Parallel set automatically when not specified Monitor your UDF by annotating it with @MonitoredUDF. If it takes too long to return Pig will kill it and return a default value instead. PigUnit for unit testing your Pig Latin scripts
  • 13.
  • 14. What’s Next? Preview of Pig 0.9 Integrate Pig with scripting languages for control flow Add macros to Pig Latin Revive ILLUSTRATE Fix most runtime type errors Rewrite parser to give useful error messages Programming Pig from O’Reilly Press
  • 15. Acknowledgements Much of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8: http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/ The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release

Editor's Notes

  1. Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.
  2. Before 0.8 this is hard in Pig because you cannot re-use the results of Pig Latin operation in another operation without joining them, even if the result is a scalar value