This slide deck is used as an introduction to the Apache Pig system and the Pig Latin high-level programming language, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
Pig is a platform for analyzing large datasets that uses Pig Latin, a high-level language, to express data analysis programs. Pig Latin programs are compiled into MapReduce jobs and executed on Hadoop. Pig Latin provides data manipulation constructs like SQL as well as user-defined functions. The Pig system compiles programs through optimization, code generation, and execution on Hadoop. Future work focuses on additional optimizations, non-Java UDFs, and interfaces like SQL.
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
This Hadoop Pig tutorial will unravel Pig Programming, Pig Commands, Pig Fundamentals, Grunt Mode, Script Mode & Embedded Mode.
At the end, you'll have a strong knowledge regarding Hadoop Pig Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is Pig?
✓ Pig Data Flows
✓ Pig Programming
----------
What is Pig?
Pig is an open source data flow language which processes data management operations via simple scripts using Pig Latin. Pig works very closely in relation with MapReduce.
----------
Applications of Pig
1. Data Cleansing
2. Data Transfers via HDFS
3. Data Factory Operations
4. Predictive Modelling
5. Business Intelligence
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Apache Pig is a platform for analyzing large datasets that consists of a high-level language called Pig Latin for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig Latin scripts are compiled into a series of MapReduce jobs that are executed to process the data. Pig provides tools for loading, storing, filtering, grouping, joining, and other operations on large datasets in parallel across a cluster using Hadoop. It aims to abstract away the complexity of MapReduce to make the data analysis process easier for analysts.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document provides an introduction to Apache Pig including:
- What Pig is and how it offers a high-level language called PigLatin for analyzing large datasets.
- How PigLatin provides common data operations and types and is more natural for analysts than MapReduce.
- Examples of how WordCount looks in PigLatin versus Java MapReduce.
- How Pig works by parsing, optimizing, and executing PigLatin scripts as MapReduce jobs on Hadoop.
- Considerations for developing, running, and optimizing PigLatin scripts.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
Pig is a platform for analyzing large datasets that uses Pig Latin, a high-level language, to express data analysis programs. Pig Latin programs are compiled into MapReduce jobs and executed on Hadoop. Pig Latin provides data manipulation constructs like SQL as well as user-defined functions. The Pig system compiles programs through optimization, code generation, and execution on Hadoop. Future work focuses on additional optimizations, non-Java UDFs, and interfaces like SQL.
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
This Hadoop Pig tutorial will unravel Pig Programming, Pig Commands, Pig Fundamentals, Grunt Mode, Script Mode & Embedded Mode.
At the end, you'll have a strong knowledge regarding Hadoop Pig Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is Pig?
✓ Pig Data Flows
✓ Pig Programming
----------
What is Pig?
Pig is an open source data flow language which processes data management operations via simple scripts using Pig Latin. Pig works very closely in relation with MapReduce.
----------
Applications of Pig
1. Data Cleansing
2. Data Transfers via HDFS
3. Data Factory Operations
4. Predictive Modelling
5. Business Intelligence
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Apache Pig is a platform for analyzing large datasets that consists of a high-level language called Pig Latin for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Pig Latin scripts are compiled into a series of MapReduce jobs that are executed to process the data. Pig provides tools for loading, storing, filtering, grouping, joining, and other operations on large datasets in parallel across a cluster using Hadoop. It aims to abstract away the complexity of MapReduce to make the data analysis process easier for analysts.
Hadoop and Pig are tools for analyzing large datasets. Hadoop uses MapReduce and HDFS for distributed processing and storage. Pig provides a high-level language for expressing data analysis jobs that are compiled into MapReduce programs. Common tasks like joins, filters, and grouping are built into Pig for easier programming compared to lower-level MapReduce.
This document provides an introduction to Apache Pig including:
- What Pig is and how it offers a high-level language called PigLatin for analyzing large datasets.
- How PigLatin provides common data operations and types and is more natural for analysts than MapReduce.
- Examples of how WordCount looks in PigLatin versus Java MapReduce.
- How Pig works by parsing, optimizing, and executing PigLatin scripts as MapReduce jobs on Hadoop.
- Considerations for developing, running, and optimizing PigLatin scripts.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This document discusses Hadoop design and k-means clustering. It outlines Hadoop's fault tolerance through task tracking and task replication. It describes Hadoop's data flow including input splitting, mapping and reducing. It also discusses optimizations like combiners. Finally it explains the k-means clustering algorithm and different approaches to implementing it in Hadoop including iterative MapReduce and partitioning large numbers of clusters.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
This Edureka Pig Tutorial ( Pig Tutorial Blog Series: https://goo.gl/KPE94k ) will help you understand the concepts of Apache Pig in depth.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this Pig Tutorial:
1) Entry of Apache Pig
2) Pig vs MapReduce
3) Twitter Case Study on Apache Pig
4) Apache Pig Architecture
5) Pig Components
6) Pig Data Model
7) Running Pig Commands and Pig Scripts (Log Analysis)
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
PyCascading provides a Python API for the Cascading framework to process data flows on Hadoop. It allows defining data flows as Python functions and operations instead of Java code. The document discusses Hadoop concepts, shows how to define a WordCount workflow in PyCascading with fewer lines of code than Java, and walks through a full example of finding friends' most common interests. Key advantages are using Python instead of Java and leveraging any Python libraries, though performance-critical parts require Java.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Python in an Evolving Enterprise System (PyData SV 2013)PyData
The document evaluates different solutions for integrating Python with Hadoop to enable data modeling on Hadoop clusters. It tests various frameworks like Native Java, Streaming, mrjob, PyCascading, and Pig using a sample budget aggregation problem. Pig and PyCascading allow complex pipelines to be expressed simply, while Pig is more performant and mature, making it the most viable option for ad-hoc analysis on Hadoop from Python.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
This document provides an overview of MapReduce, including:
- MapReduce is a programming model for processing large datasets in parallel across clusters of computers.
- It works by breaking the processing into map and reduce functions that can be run on many machines.
- Examples are given like word counting, distributed grep, and analyzing web server logs.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It provides a simple language called Pig Latin for expressing data analysis processes. Pig Latin scripts are compiled into series of MapReduce jobs that process and analyze data in parallel across a Hadoop cluster. Pig aims to be easier to use than raw MapReduce programs by providing high-level operations like JOIN, FILTER, GROUP, and allowing analysis to be expressed without writing Java code. Common use cases for Pig include log and web data analysis, ETL processes, and quick prototyping of algorithms for large-scale data.
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
The document discusses integrating R and Hadoop for big data analytics. It notes that existing statistical applications like R are incapable of handling big data, while data management tools lack analytical capabilities. Integrating R with Hadoop bridges this gap by leveraging R's analytics and statistics functionality with Hadoop's ability to process and store distributed data. RHadoop is introduced as an open source project that allows R programmers to directly use MapReduce functionality in R code. Specific RHadoop packages like rhdfs and rmr2 are described that enable interacting with HDFS and performing statistical analysis via MapReduce on Hadoop clusters. Text analytics use cases with R and Hadoop like sentiment analysis are also briefly outlined.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This document discusses Hadoop design and k-means clustering. It outlines Hadoop's fault tolerance through task tracking and task replication. It describes Hadoop's data flow including input splitting, mapping and reducing. It also discusses optimizations like combiners. Finally it explains the k-means clustering algorithm and different approaches to implementing it in Hadoop including iterative MapReduce and partitioning large numbers of clusters.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. It has two key components - the Map function which processes input data into key-value pairs, and the Reduce function which aggregates the intermediate output of the Map into a final result. Input data is split across multiple machines which apply the Map function in parallel, and the Reduce function is applied to aggregate the outputs.
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
This Edureka Pig Tutorial ( Pig Tutorial Blog Series: https://goo.gl/KPE94k ) will help you understand the concepts of Apache Pig in depth.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this Pig Tutorial:
1) Entry of Apache Pig
2) Pig vs MapReduce
3) Twitter Case Study on Apache Pig
4) Apache Pig Architecture
5) Pig Components
6) Pig Data Model
7) Running Pig Commands and Pig Scripts (Log Analysis)
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
PyCascading provides a Python API for the Cascading framework to process data flows on Hadoop. It allows defining data flows as Python functions and operations instead of Java code. The document discusses Hadoop concepts, shows how to define a WordCount workflow in PyCascading with fewer lines of code than Java, and walks through a full example of finding friends' most common interests. Key advantages are using Python instead of Java and leveraging any Python libraries, though performance-critical parts require Java.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Python in an Evolving Enterprise System (PyData SV 2013)PyData
The document evaluates different solutions for integrating Python with Hadoop to enable data modeling on Hadoop clusters. It tests various frameworks like Native Java, Streaming, mrjob, PyCascading, and Pig using a sample budget aggregation problem. Pig and PyCascading allow complex pipelines to be expressed simply, while Pig is more performant and mature, making it the most viable option for ad-hoc analysis on Hadoop from Python.
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
This document provides an overview of MapReduce, including:
- MapReduce is a programming model for processing large datasets in parallel across clusters of computers.
- It works by breaking the processing into map and reduce functions that can be run on many machines.
- Examples are given like word counting, distributed grep, and analyzing web server logs.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
The document summarizes a presentation on using R and Hadoop together. It includes:
1) An outline of topics to be covered including why use MapReduce and R, options for combining R and Hadoop, an overview of RHadoop, a step-by-step example, and advanced RHadoop features.
2) Code examples from Jonathan Seidman showing how to analyze airline on-time data using different R and Hadoop options - naked streaming, Hive, RHIPE, and RHadoop.
3) The analysis calculates average departure delays by year, month and airline using each method.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
This document provides an overview of applied recommender systems. It discusses Hadoop, MapReduce, Hive, Mahout and collaborative filtering recommender algorithms. Hadoop is used to process large datasets in parallel across clusters. MapReduce is the programming model and Hive provides a SQL-like interface. Mahout contains machine learning libraries including collaborative filtering algorithms to generate recommendations. Pearson correlation is discussed as an item-item collaborative filtering approach.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It is written in Java and uses a pluggable backend. Presto is fast due to code generation and runtime compilation techniques. It provides a library and framework for building distributed services and fast Java collections. Plugins allow Presto to connect to different data sources like Hive, Cassandra, MongoDB and more.
Using and scaling Rack and Rack-based middlewareAlona Mekhovova
Rack provides a standard interface between web servers and web applications. It allows a web application to return a status, headers, and a body in response to an HTTP request. Middleware can be plugged into a Rack application to modify requests and responses. Popular Rack middleware includes Rack::Cache, Rack::Middleware, and Warden for authentication. In Rails, middleware is configured through an initializer and plugged into the middleware stack to run before or after other middleware.
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
This document provides an introduction to Big Data and Apache Hadoop. It defines Big Data as large and complex datasets that are difficult to process using traditional database tools. It describes how Hadoop uses MapReduce and HDFS to provide scalable storage and parallel processing of Big Data. It provides examples of companies using Hadoop to analyze exabytes of data and common Hadoop use cases like log analysis. Finally, it summarizes some popular Hadoop ecosystem projects like Hive, Pig, and Zookeeper that provide SQL-like querying, data flows, and coordination.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Owen O'Malley is an architect at Yahoo who works full-time on Hadoop. He discusses Hadoop's origins, how it addresses the problem of scaling applications to large datasets, and its key components including HDFS and MapReduce. Yahoo uses Hadoop extensively, including for building its Webmap and running experiments on large datasets.
The document discusses various techniques for optimizing web program performance, including improving data structures and algorithms, reducing network, disk, and memory I/O, using NoSQL databases, optimizing frontend resource loading and execution, and testing and refactoring code for performance. It also covers specific optimizations related to caching, content delivery networks, domain name resolution, module loading, code packaging, and handling slower users.
Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
The document discusses composing reusable extract-transform-load (ETL) processes on Hadoop. It covers the data science lifecycle of acquiring, analyzing and taking action on data. It states that 80% of work in data science is spent on acquiring and preparing data. The document then discusses using Cascading, an abstraction framework for building MapReduce jobs, to create reusable ETL processes that are linearly scalable and follow a single-purpose composable design.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.
In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.
Spark is a fast and general engine for large-scale data processing. It runs on Hadoop clusters through YARN and Mesos, and can also run standalone. Spark is up to 100x faster than Hadoop for certain applications because it keeps data in memory rather than disk, and it supports iterative algorithms through its Resilient Distributed Dataset (RDD) abstraction. The presenter provides a demo of Spark's word count algorithm in Scala, Java, and Python to illustrate how easy it is to use Spark across languages.
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!Priyanka Aash
"We propose a new exploit technique that brings a whole-new attack surface to defeat path normalization, which is complicated in implementation due to many implicit properties and edge cases. This complication, being under-estimated or ignored by developers for a long time, has made our proposed attack vector possible, lethal, and general. Therefore, many 0days have been discovered via this approach in popular web frameworks written in trending programming languages, including Python, Ruby, Java, and JavaScript.
Being a very fundamental problem that exists in path normalization logic, sophisticated web frameworks can also suffer. For example, we've found various 0days on Java Spring Framework, Ruby on Rails, Next.js, and Python aiohttp, just to name a few. This general technique can also adapt to multi-layered web architecture, such as using Nginx or Apache as a proxy for Tomcat. In that case, reverse proxy protections can be bypassed. To make things worse, we're able to chain path normalization bugs to bypass authentication and achieve RCE in real world Bug Bounty Programs. Several scenarios will be demonstrated to illustrate how path normalization can be exploited to achieve sensitive information disclosure, SMB-Relay and RCE.
Understanding the basics of this technique, the audience won't be surprised to know that more than 10 vulnerabilities have been found in sophisticated frameworks and multi-layered web architectures aforementioned via this technique."
Similar to High-level Programming Languages: Apache Pig and Pig Latin (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
2. Apache Pig
Apache Pig
See also the 4 segments on Pig on coursera:
https://www.coursera.org/course/datasci
Pietro Michiardi (Eurecom) High-level Programming Languages 2 / 78
3. Apache Pig Introduction
Introduction
Collection and analysis of enormous datasets is at the heart
of innovation in many organizations
E.g.: web crawls, search logs, click streams
Manual inspection before batch processing
Very often engineers look for exploitable trends in their data to drive
the design of more sophisticated techniques
This is difficult to do in practice, given the sheer size of the datasets
The MapReduce model has its own limitations
One input
Two-stage, two operators
Rigid data-flow
Pietro Michiardi (Eurecom) High-level Programming Languages 3 / 78
4. Apache Pig Introduction
MapReduce limitations
Very often tricky workarounds are required1
This is very often exemplified by the difficulty in performing JOIN
operations
Custom code required even for basic operations
Projection and Filtering need to be “rewritten” for each job
→ Code is difficult to reuse and maintain
→ Semantics of the analysis task are obscured
→ Optimizations are difficult due to opacity of Map and Reduce
1
The term workaround should not only be intended as negative.
Pietro Michiardi (Eurecom) High-level Programming Languages 4 / 78
5. Apache Pig Introduction
Use Cases
Rollup aggregates
Compute aggregates against user activity logs, web crawls,
etc.
Example: compute the frequency of search terms aggregated over
days, weeks, month
Example: compute frequency of search terms aggregated over
geographical location, based on IP addresses
Requirements
Successive aggregations
Joins followed by aggregations
Pig vs. OLAP systems
Datasets are too big
Data curation is too costly
Pietro Michiardi (Eurecom) High-level Programming Languages 5 / 78
6. Apache Pig Introduction
Use Cases
Temporal Analysis
Study how search query distributions change over time
Correlation of search queries from two distinct time periods (groups)
Custom processing of the queries in each correlation group
Pig supports operators that minimize memory footprint
Instead, in a RDBMS such operations typically involve JOINS over
very large datasets that do not fit in memory and thus become slow
Pietro Michiardi (Eurecom) High-level Programming Languages 6 / 78
7. Apache Pig Introduction
Use Cases
Session Analysis
Study sequences of page views and clicks
Example of typical aggregates
Average length of user session
Number of links clicked by a user before leaving a website
Click pattern variations in time
Pig supports advanced data structures, and UDFs
Pietro Michiardi (Eurecom) High-level Programming Languages 7 / 78
8. Apache Pig Overview
Pig Latin
Pig Latin, a high-level programming language initially
developed at Yahoo!, now at HortonWorks
Combines the best of both declarative and imperative worlds
High-level declarative querying in the spirit of SQL
Low-level, procedural programming á la MapReduce
Pig Latin features
Multi-valued, nested data structures instead of flat tables
Powerful data transformations primitives, including joins
Pig Latin program
Made up of a series of operations (or transformations)
Each operation is applied to input data and produce output data
→ A Pig Latin program describes a data flow
Pietro Michiardi (Eurecom) High-level Programming Languages 8 / 78
9. Apache Pig Overview
Example 1
Pig Latin premiere
Assume we have the following table:
urls: (url, category, pagerank)
Where:
url: is the url of a web page
category: corresponds to a pre-defined category for the web page
pagerank: is the numerical value of the pagerank associated to a
web page
→ Find, for each sufficiently large category, the average page rank of
high-pagerank urls in that category
Pietro Michiardi (Eurecom) High-level Programming Languages 9 / 78
10. Apache Pig Overview
Example 1
SQL
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106
Pietro Michiardi (Eurecom) High-level Programming Languages 10 / 78
11. Apache Pig Overview
Example 1
Pig Latin
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls) > 106
;
output = FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank);
Pietro Michiardi (Eurecom) High-level Programming Languages 11 / 78
12. Apache Pig Overview
Example 2
User data in one file, website data in another
Find the top 5 most visited sites
Group by users aged in the range (18,25)se Pig?
have user data in one
data in another, and
find the top 5 most
by users aged 18 -
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Pietro Michiardi (Eurecom) High-level Programming Languages 12 / 78
13. Apache Pig Overview
Example 2: in MapReduce
In MapReduceimport java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.mapred.lib.IdentityMapper;
public class MRExample {
public static class LoadPages extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String key = line.substring(0, firstComma);
String value = line.substring(firstComma + 1);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("1" + value);
oc.collect(outKey, outVal);
}
}
public static class LoadAndFilterUsers extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable k, Text val,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// Pull the key out
String line = val.toString();
int firstComma = line.indexOf(',');
String value = line.substring(firstComma + 1);
int age = Integer.parseInt(value);
if (age < 18 || age > 25) return;
String key = line.substring(0, firstComma);
Text outKey = new Text(key);
// Prepend an index to the value so we know which file
// it came from.
Text outVal = new Text("2" + value);
oc.collect(outKey, outVal);
}
}
public static class Join extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key,
Iterator<Text> iter,
OutputCollector<Text, Text> oc,
Reporter reporter) throws IOException {
// For each value, figure out which file it's from and
store it
// accordingly.
List<String> first = new ArrayList<String>();
List<String> second = new ArrayList<String>();
while (iter.hasNext()) {
Text t = iter.next();
String value = t.toString();
if (value.charAt(0) == '1')
first.add(value.substring(1));
else second.add(value.substring(1));
reporter.setStatus("OK");
}
// Do the cross product and collect the values
for (String s1 : first) {
for (String s2 : second) {
String outval = key + "," + s1 + "," + s2;
oc.collect(null, new Text(outval));
reporter.setStatus("OK");
}
}
}
}
public static class LoadJoined extends MapReduceBase
implements Mapper<Text, Text, Text, LongWritable> {
public void map(
Text k,
Text val,
OutputCollector<Text, LongWritable> oc,
Reporter reporter) throws IOException {
// Find the url
String line = val.toString();
int firstComma = line.indexOf(',');
int secondComma = line.indexOf(',', firstComma);
String key = line.substring(firstComma, secondComma);
// drop the rest of the record, I don't need it anymore,
// just pass a 1 for the combiner/reducer to sum instead.
Text outKey = new Text(key);
oc.collect(outKey, new LongWritable(1L));
}
}
public static class ReduceUrls extends MapReduceBase
implements Reducer<Text, LongWritable, WritableComparable,
Writable> {
public void reduce(
Text key,
Iterator<LongWritable> iter,
OutputCollector<WritableComparable, Writable> oc,
Reporter reporter) throws IOException {
// Add up all the values we see
long sum = 0;
while (iter.hasNext()) {
sum += iter.next().get();
reporter.setStatus("OK");
}
oc.collect(key, new LongWritable(sum));
}
}
public static class LoadClicks extends MapReduceBase
implements Mapper<WritableComparable, Writable, LongWritable,
Text> {
public void map(
WritableComparable key,
Writable val,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
oc.collect((LongWritable)val, (Text)key);
}
}
public static class LimitClicks extends MapReduceBase
implements Reducer<LongWritable, Text, LongWritable, Text> {
int count = 0;
public void reduce(
LongWritable key,
Iterator<Text> iter,
OutputCollector<LongWritable, Text> oc,
Reporter reporter) throws IOException {
// Only output the first 100 records
while (count < 100 && iter.hasNext()) {
oc.collect(key, iter.next());
count++;
}
}
}
public static void main(String[] args) throws IOException {
JobConf lp = new JobConf(MRExample.class);
lp.setJobName("Load Pages");
lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class);
lp.setOutputValueClass(Text.class);
lp.setMapperClass(LoadPages.class);
FileInputFormat.addInputPath(lp, new
Path("/user/gates/pages"));
FileOutputFormat.setOutputPath(lp,
new Path("/user/gates/tmp/indexed_pages"));
lp.setNumReduceTasks(0);
Job loadPages = new Job(lp);
JobConf lfu = new JobConf(MRExample.class);
lfu.setJobName("Load and Filter Users");
lfu.setInputFormat(TextInputFormat.class);
lfu.setOutputKeyClass(Text.class);
lfu.setOutputValueClass(Text.class);
lfu.setMapperClass(LoadAndFilterUsers.class);
FileInputFormat.addInputPath(lfu, new
Path("/user/gates/users"));
FileOutputFormat.setOutputPath(lfu,
new Path("/user/gates/tmp/filtered_users"));
lfu.setNumReduceTasks(0);
Job loadUsers = new Job(lfu);
JobConf join = new JobConf(MRExample.class);
join.setJobName("Join Users and Pages");
join.setInputFormat(KeyValueTextInputFormat.class);
join.setOutputKeyClass(Text.class);
join.setOutputValueClass(Text.class);
join.setMapperClass(IdentityMapper.class);
join.setReducerClass(Join.class);
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/indexed_pages"));
FileInputFormat.addInputPath(join, new
Path("/user/gates/tmp/filtered_users"));
FileOutputFormat.setOutputPath(join, new
Path("/user/gates/tmp/joined"));
join.setNumReduceTasks(50);
Job joinJob = new Job(join);
joinJob.addDependingJob(loadPages);
joinJob.addDependingJob(loadUsers);
JobConf group = new JobConf(MRExample.class);
group.setJobName("Group URLs");
group.setInputFormat(KeyValueTextInputFormat.class);
group.setOutputKeyClass(Text.class);
group.setOutputValueClass(LongWritable.class);
group.setOutputFormat(SequenceFileOutputFormat.class);
group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
group.setReducerClass(ReduceUrls.class);
FileInputFormat.addInputPath(group, new
Path("/user/gates/tmp/joined"));
FileOutputFormat.setOutputPath(group, new
Path("/user/gates/tmp/grouped"));
group.setNumReduceTasks(50);
Job groupJob = new Job(group);
groupJob.addDependingJob(joinJob);
JobConf top100 = new JobConf(MRExample.class);
top100.setJobName("Top 100 sites");
top100.setInputFormat(SequenceFileInputFormat.class);
top100.setOutputKeyClass(LongWritable.class);
top100.setOutputValueClass(Text.class);
top100.setOutputFormat(SequenceFileOutputFormat.class);
top100.setMapperClass(LoadClicks.class);
top100.setCombinerClass(LimitClicks.class);
top100.setReducerClass(LimitClicks.class);
FileInputFormat.addInputPath(top100, new
Path("/user/gates/tmp/grouped"));
FileOutputFormat.setOutputPath(top100, new
Path("/user/gates/top100sitesforusers18to25"));
top100.setNumReduceTasks(1);
Job limit = new Job(top100);
limit.addDependingJob(groupJob);
JobControl jc = new JobControl("Find top 100 sites for users
18 to 25");
jc.addJob(loadPages);
jc.addJob(loadUsers);
jc.addJob(joinJob);
jc.addJob(groupJob);
jc.addJob(limit);
jc.run();
}
}
170 lines of code, 4 hours to write
Hundreds lines of code; hours to write
Pietro Michiardi (Eurecom) High-level Programming Languages 13 / 78
14. Apache Pig Overview
Example 2: in Pig
Users = load ’users’ as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ’pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by
url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5;
store Top5 into ’top5sites’;
Few lines of code; few minutes to write
Pietro Michiardi (Eurecom) High-level Programming Languages 14 / 78
15. Apache Pig Overview
Pig Execution environment
How do we go from Pig Latin to MapReduce?
The Pig system is in charge of this
Complex execution environment that interacts with Hadoop
MapReduce
→ The programmer focuses on the data and analysis
Pig Compiler
Pig Latin operators are translated into MapReduce code
NOTE: in some cases, hand-written MapReduce code performs
better
Pig Optimizer2
Pig Latin data flows undergo an (automatic) optimization phase3
These optimizations are borrowed from the RDBMS community
2
Currently, rule-based optimization only.
3
Optimizations can be selectively disabled.
Pietro Michiardi (Eurecom) High-level Programming Languages 15 / 78
16. Apache Pig Overview
Pig and Pig Latin
Pig is not a RDBMS!
This means it is not suitable for all data processing tasks
Designed for batch processing
Of course, since it compiles to MapReduce
Of course, since data is materialized as files on HDFS
NOT designed for random access
Query selectivity does not match that of a RDBMS
Full-scans oriented!
Pietro Michiardi (Eurecom) High-level Programming Languages 16 / 78
17. Apache Pig Overview
Comparison with RDBMS
It may seem that Pig Latin is similar to SQL
We’ll see several examples, operators, etc. that resemble SQL
statements
Data-flow vs. declarative programming language
Data-flow:
Step-by-step set of operations
Each operation is a single transformation
Declarative:
Set of constraints
Applied together to an input to generate output
→ With Pig Latin it’s like working at the query planner
Pietro Michiardi (Eurecom) High-level Programming Languages 17 / 78
18. Apache Pig Overview
Comparison with RDBMS
RDBMS store data in tables
Schema are predefined and strict
Tables are flat
Pig and Pig Latin work on more complex data structures
Schema can be defined at run-time for readability
Pigs eat anything!
UDF and streaming together with nested data structures make Pig
and Pig Latin more flexible
Pietro Michiardi (Eurecom) High-level Programming Languages 18 / 78
19. Apache Pig Features and Motivations
Dataflow Language
A Pig Latin program specifies a series of steps
Each step is a single, high level data transformation
Stylistically different from SQL
With reference to Example 1
The programmer supply an order in which each operation will be
done
Consider the following snippet
spam_urls = FILTER urls BY isSpam(url);
culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Pietro Michiardi (Eurecom) High-level Programming Languages 19 / 78
20. Apache Pig Features and Motivations
Dataflow Language
Data flow optimizations
Explicit sequences of operations can be overridden
Use of high-level, relational-algebra-style primitives (GROUP,
FILTER,...) allows using traditional RDBMS optimization
techniques
→ NOTE: it is necessary to check whether such optimizations
are beneficial or not, by hand
Pig Latin allows Pig to perform optimizations that would
otherwise by a tedious manual exercise if done at the
MapReduce level
Pietro Michiardi (Eurecom) High-level Programming Languages 20 / 78
21. Apache Pig Features and Motivations
Quick Start and Interoperability
Data I/O is greatly simplified in Pig
No need to curate, bulk import, parse, apply schema, create
indexes that traditional RDBMS require
Standard and ad-hoc “readers” and “writers” facilitate the task of
ingesting and producing data in arbitrary formats
Pig can work with a wide range of other tools
Why RDBMS have stringent requirements?
To enable transactional consistency guarantees
To enable efficient point lookup (using physical indexes)
To enable data curation on behalf of the user
To enable other users figuring out what the data is, by studying the
schema
Pietro Michiardi (Eurecom) High-level Programming Languages 21 / 78
22. Apache Pig Features and Motivations
Quick Start and Interoperability
Why is Pig so flexible?
Supports read-only workloads
Supports scan-only workloads (no lookups)
→ No need for transactions nor indexes
Why data curation is not required?
Very often, Pig is used for ad-hoc data analysis
Work on temporary datasets, then throw them out!
→ Curation is an overkill
Schemas are optional
Can apply one on the fly, at runtime
Can refer to fields using positional notation
E.g.: good_urls = FILTER urls BY $2 > 0.2
Pietro Michiardi (Eurecom) High-level Programming Languages 22 / 78
23. Apache Pig Features and Motivations
Nested Data Model
Easier for “programmers” to think of nested data structures
E.g.: capture information about positional occurrences of terms in a
collection of documents
Map<documnetId, Set<positions> >
Instead, RDBMS allows only fat tables
Only atomic fields as columns
Require normalization
From the example above: need to create two tables
term_info: (termId, termString, ...)
position_info: (termId, documentId, position)
→ Occurrence information obtained by joining on termId, and
grouping on termId, documentId
Pietro Michiardi (Eurecom) High-level Programming Languages 23 / 78
24. Apache Pig Features and Motivations
Nested Data Model
Fully nested data model (see also later in the presentation)
Allows complex, non-atomic data types
E.g.: set, map, tuple
Advantages of a nested data model
More natural than normalization
Data is often already stored in a nested fashion on disk
E.g.: a web crawler outputs for each crawled url, the set of outlinks
Separating this in normalized form imply use of joins, which is an
overkill for web-scale data
Nested data allows to have an algebraic language
E.g.: each tuple output by GROUP has one non-atomic field, a nested
set of tuples from the same group
Nested data makes life easy when writing UDFs
Pietro Michiardi (Eurecom) High-level Programming Languages 24 / 78
25. Apache Pig Features and Motivations
User Defined Functions
Custom processing is often predominant
E.g.: users may be interested in performing natural language
stemming of a search term, or tagging urls as spam
All commands of Pig Latin can be customized
Grouping, filtering, joining, per-tuple processing
UDFs support the nested data model
Input and output can be non-atomic
Pietro Michiardi (Eurecom) High-level Programming Languages 25 / 78
26. Apache Pig Features and Motivations
Example 3
Continues from Example 1
Assume we want to find for each category, the top 10 urls according
to pagerank
groups = GROUP urls BY category;
output = FOREACH groups GENERATE category,
top10(urls);
top10() is a UDF that accepts a set of urls (for each group at a
time)
it outputs a set containing the top 10 urls by pagerank for that
group
final output contains non-atomic fields
Pietro Michiardi (Eurecom) High-level Programming Languages 26 / 78
27. Apache Pig Features and Motivations
User Defined Functions
UDFs can be used in all Pig Latin constructs
Instead, in SQL, there are restrictions
Only scalar functions can be used in SELECT clauses
Only set-valued functions can appear in the FROM clause
Aggregation functions can only be applied to GROUP BY or
PARTITION BY
UDFs can be written in Java, Python and Javascript
With streaming, we can use also C/C++, Python, ...
Pietro Michiardi (Eurecom) High-level Programming Languages 27 / 78
28. Apache Pig Features and Motivations
Handling parallel execution
Pig and Pig Latin are geared towards parallel processing
Of course, the underlying execution engine is MapReduce
SPORK = Pig on Spark → the execution engine need not be
MapReduce
Pig Latin primitives are chosen such that they can be easily
parallelized
Non-equi joins, correlated sub-queries,... are not directly supported
Users may specify parallelization parameters at run time
Question: Can you specify the number of maps?
Question: Can you specify the number of reducers?
Pietro Michiardi (Eurecom) High-level Programming Languages 28 / 78
29. Apache Pig Features and Motivations
A note on Performance
But can it fly?
src: OlstonPietro Michiardi (Eurecom) High-level Programming Languages 29 / 78
30. Apache Pig Pig Latin
Pig Latin
Pietro Michiardi (Eurecom) High-level Programming Languages 30 / 78
31. Apache Pig Pig Latin
Introduction
Not a complete reference to the Pig Latin language: refer to [1]
Here we cover some interesting/useful aspects
The focus here is on some language primitives
Optimizations are treated separately
How they can be implemented (in the underlying engine) is not
covered
Examples are taken from [2, 3]
Pietro Michiardi (Eurecom) High-level Programming Languages 31 / 78
32. Apache Pig Pig Latin
Data Model
Supports four types
Atom: contains a simple atomic value as a string or a number, e.g.
‘alice’
Tuple: sequence of fields, each can be of any data type, e.g.,
(‘alice’, ‘lakers’)
Bag: collection of tuples with possible duplicates. Flexible schema,
no need to have the same number and type of fields
(‘alice’,‘lakers’)
(‘alice’,(‘iPod’,‘apple’))
The example shows that tuples can be nested
Pietro Michiardi (Eurecom) High-level Programming Languages 32 / 78
33. Apache Pig Pig Latin
Data Model
Supports four types
Map: collection of data items, where each item has an associated
key for lookup. The schema, as with bags, is flexible.
NOTE: keys are required to be data atoms, for efficient lookup.
‘fan of’ →
(‘lakers’)
(‘iPod’)
‘age’ → 20
The key ‘fan of’ is mapped to a bag containing two tuples
The key ‘age’ is mapped to an atom
Maps are useful to model datasets in which schema may be
dynamic (over time)
Pietro Michiardi (Eurecom) High-level Programming Languages 33 / 78
34. Apache Pig Pig Latin
Structure
Pig Latin programs are a sequence of steps
Can use an interactive shell (called grunt)
Can feed them as a “script”
Comments
In line: with double hyphens (- -)
C-style for longer comments (/* ... */)
Reserved keywords
List of keywords that can’t be used as identifiers
Same old story as for any language
Pietro Michiardi (Eurecom) High-level Programming Languages 34 / 78
35. Apache Pig Pig Latin
Statements
As a Pig Latin program is executed, each statement is parsed
The interpreter builds a logical plan for every relational operation
The logical plan of each statement is added to that of the program
so far
Then the interpreter moves on to the next statement
IMPORTANT: No data processing takes place during
construction of logical plan → Lazy Evaluation
When the interpreter sees the first line of a program, it confirms that
it is syntactically and semantically correct
Then it adds it to the logical plan
It does not even check the existence of files, for data load
operations
Pietro Michiardi (Eurecom) High-level Programming Languages 35 / 78
36. Apache Pig Pig Latin
Statements
→ It makes no sense to start any processing until the whole
flow is defined
Indeed, there are several optimizations that could make a program
more efficient (e.g., by avoiding to operate on some data that later
on is going to be filtered)
The trigger for Pig to start execution are the DUMP and STORE
statements
It is only at this point that the logical plan is compiled into a physical
plan
How the physical plan is built
Pig prepares a series of MapReduce jobs
In Local mode, these are run locally on the JVM
In MapReduce mode, the jobs are sent to the Hadoop Cluster
IMPORTANT: The command EXPLAIN can be used to show the
MapReduce plan
Pietro Michiardi (Eurecom) High-level Programming Languages 36 / 78
37. Apache Pig Pig Latin
Statements
Multi-query execution
There is a difference between DUMP and STORE
Apart from diagnosis, and interactive mode, in batch mode STORE
allows for program/job optimizations
Main optimization objective: minimize I/O
Consider the following example:
A = LOAD ’input/pig/multiquery/A’;
B = FILTER A BY $1 == ’banana’;
C = FILTER A BY $1 != ’banana’;
STORE B INTO ’output/b’;
STORE C INTO ’output/c’;
Pietro Michiardi (Eurecom) High-level Programming Languages 37 / 78
38. Apache Pig Pig Latin
Statements
Multi-query execution
In the example, relations B and C are both derived from A
Naively, this means that at the first STORE operator the input should
be read
Then, at the second STORE operator, the input should be read again
Pig will run this as a single MapReduce job
Relation A is going to be read only once
Then, each relation B and C will be written to the output
Pietro Michiardi (Eurecom) High-level Programming Languages 38 / 78
39. Apache Pig Pig Latin
Expressions
An expression is something that is evaluated to yield a value
Lookup on [3] for documentation
ython, so that
sing web-scale
n-parallel eval-
d in Pig Latin
t can be easily
ot lend them-
non-equi-joins,
excluded.
d out by writ-
es not provide
are aware of
ther they will
program right
few iterations
t =
„
‘alice’,
⇤
(‘lakers’, 1)
(‘iPod’, 2)
⌅
,
ˆ
‘age’ ⇤ 20
˜
«
Let fields of tuple t be called f1, f2, f3
Expression Type Example Value for t
Constant ‘bob’ Independent of t
Field by position $0 ‘alice’
Field by name f3
ˆ
‘age’ ⇤ 20
˜
Projection f2.$0
⇤
(‘lakers’)
(‘iPod’)
⌅
Map Lookup f3#‘age’ 20
Function Evaluation SUM(f2.$1) 1 + 2 = 3
Conditional
Expression
f3#‘age’>18?
‘adult’:‘minor’
‘adult’
Flattening FLATTEN(f2)
‘lakers’, 1
‘iPod’, 2
Table 1: Expressions in Pig Latin.
Pietro Michiardi (Eurecom) High-level Programming Languages 39 / 78
40. Apache Pig Pig Latin
Schemas
A relation in Pig may have an associated schema
This is optional
A schema gives the fields in the relations names and types
Use the command DESCRIBE to reveal the schema in use for a
relation
Schema declaration is flexible but reuse is awkward4
A set of queries over the same input data will often have the same
schema
This is sometimes hard to maintain (unlike HIVE) as there is no
external components to maintain this association
HINT:: You can write a UDF function to perform a personalized load
operation which encapsulates the schema
4
Current developments solve this problem: HCatalogs. We will not cover this in this
course.
Pietro Michiardi (Eurecom) High-level Programming Languages 40 / 78
41. Apache Pig Pig Latin
Validation and nulls
Pig does not have the same power to enforce constraints on
schema at load time as a RDBMS
If a value cannot be cast to a type declared in the schema, then it
will be set to a null value
This also happens for corrupt files
A useful technique to partition input data to discern good and
bad records
Use the SPLIT operator
SPLIT records INTO good_records IF temperature is
not null, bad _records IF temperature is NULL;
Pietro Michiardi (Eurecom) High-level Programming Languages 41 / 78
42. Apache Pig Pig Latin
Other relevant information
Schema propagation and merging
How schema are propagated to new relations?
Advanced, but important topic
User-Defined Functions
Use [3] for an introduction to designing UDFs
Pietro Michiardi (Eurecom) High-level Programming Languages 42 / 78
43. Apache Pig Pig Latin
Data Processing Operators
Loading and storing data
The first step in a Pig Latin program is to load data
Accounts for what input files are (e.g. csv files)
How the file contents are to be deserialized
An input file is assumed to contain a sequence of tuples
Data loading is done with the LOAD command
queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userId, queryString, timestamp);
Pietro Michiardi (Eurecom) High-level Programming Languages 43 / 78
44. Apache Pig Pig Latin
Data Processing Operators
Loading and storing data
The example above specifies the following:
The input file is query_log.txt
The input file should be converted into tuples using the custom
myLoad deserializer
The loaded tuples have three fields, specified by the schema
Optional parts
USING clause is optional: if not specified, the input file is assumed
to be plain text, tab-delimited
AS clause is optional: if not specified, must refer to fileds by position
instead of by name
Pietro Michiardi (Eurecom) High-level Programming Languages 44 / 78
45. Apache Pig Pig Latin
Data Processing Operators
Loading and storing data
Return value of the LOAD command
Handle to a bag
This can be used by subsequent commands
→ bag handles are only logical
→ no file is actually read!
The command to write output to disk is STORE
It has similar semantics to the LOAD command
Pietro Michiardi (Eurecom) High-level Programming Languages 45 / 78
46. Apache Pig Pig Latin
Data Processing Operators
Loading and storing data: Example
A = LOAD ’myfile.txt’ USING PigStorage(’,’) AS
(f1,f2,f3);
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
Pietro Michiardi (Eurecom) High-level Programming Languages 46 / 78
47. Apache Pig Pig Latin
Data Processing Operators
Per-tuple processing
Once you have some data loaded into a relation, a possible
next step is, e.g., to filter it
This is done, e.g., to remove unwanted data
HINT: By filtering early in the processing pipeline, you minimize the
amount of data flowing trough the system
A basic operation is to apply some processing over every
tuple of a data set
This is achieved with the FOREACH command
expanded_queries = FOREACH queries GENERATE
userId, expandQuery(queryString);
Pietro Michiardi (Eurecom) High-level Programming Languages 47 / 78
48. Apache Pig Pig Latin
Data Processing Operators
Per-tuple processing
Comments on the example above:
Each tuple of the bag queries should be processed independently
The second field of the output is the result of a UDF
Semantics of the FOREACH command
There can be no dependence between the processing of different
input tuples
→ This allows for an efficient parallel implementation
Semantics of the GENERATE clause
Followed by a list of expressions
Also flattening is allowed
This is done to eliminate nesting in data
→ Allows to make output data independent for further parallel
processing
→ Useful to store data on disk
Pietro Michiardi (Eurecom) High-level Programming Languages 48 / 78
49. Apache Pig Pig Latin
Data Processing Operators
Per-tuple processing: example
X = FOREACH A GENERATE f0, f1+f2;
Y = GROUP A BY f0;
Z = FOREACH Y GENERATE group, Y.($1, $2);
A=
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
X=
<1, 5>
<4, 3>
<8, 7>
<4, 6>
<7, 7>
<8, 7>
Z=
<1, {<2, 3>}>
<4, {<2, 1>, <3, 3>}>
<7, {<2, 5>}>
<8, {<3, 4>, <4, 3>}>
Pietro Michiardi (Eurecom) High-level Programming Languages 49 / 78
50. Apache Pig Pig Latin
Data Processing Operators
Per-tuple processing: Discarding unwanted data
A common operation is to retain a portion of the input data
This is done with the FILTER command
real_queries = FILTER queries BY userId neq
‘bot’;
Filtering conditions involve a combination of expressions
Comparison operators
Logical connectors
UDF
Pietro Michiardi (Eurecom) High-level Programming Languages 50 / 78
51. Apache Pig Pig Latin
Data Processing Operators
Filtering: example
Y = FILTER A BY f1 == ’8’;
A=
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
Y=
<8, 3, 4>
<8, 4, 3>
Pietro Michiardi (Eurecom) High-level Programming Languages 51 / 78
52. Apache Pig Pig Latin
Data Processing Operators
Per-tuple processing: Streaming data
The STREAM operator allows transforming data in a relation
using an external program or script
This is possible because Hadoop MapReduce supports “streaming”
Example:
C = STREAM A THROUGH ‘cut -f 2’;
which use the Unix cut command to extract the second filed of
each tuple in A
The STREAM operator uses PigStorage to serialize and
deserialize relations to and from stdin/stdout
Can also provide a custom serializer/deserializer
Works well with python
Pietro Michiardi (Eurecom) High-level Programming Languages 52 / 78
53. Apache Pig Pig Latin
Data Processing Operators
Getting related data together
It is often necessary to group together tuples from one or
more data sets
We will explore several nuances of “grouping”
Pietro Michiardi (Eurecom) High-level Programming Languages 53 / 78
54. Apache Pig Pig Latin
Data Processing Operators
The GROUP operator
Sometimes, we want to operate on a single dataset
This is when you use the GROUP operator
Let’s continue from Example 3:
Assume we want to find the total revenue for each query string.
This writes as:
grouped_revenue = GROUP revenue BY queryString;
query_revenue = FOREACH grouped_revenue GENERATE
queryString, SUM(revenue.amount) AS totalRevenue;
Note that revenue.amount refers to a projection of the nested
bag in the tuples of grouped_revenue
Pietro Michiardi (Eurecom) High-level Programming Languages 54 / 78
55. Apache Pig Pig Latin
Data Processing Operators
GROUP ... BY ...: Example
X = GROUP A BY f1;
A=
<1, 2, 3>
<4, 2, 1>
<8, 3, 4>
<4, 3, 3>
<7, 2, 5>
<8, 4, 3>
X=
<1, <1, 2, 3>>
<4, <4, 2, 1>, <4, 3, 3>>
<7, <7, 2, 5>>
<8, <8, 3, 4>, <8, 4, 3>>
Pietro Michiardi (Eurecom) High-level Programming Languages 55 / 78
56. Apache Pig Pig Latin
Data Processing Operators
Getting related data together
Suppose we want to group together all search results data
and revenue data for the same query string
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
Figure 2: COGROUP versus JOIN.
Pietro Michiardi (Eurecom) High-level Programming Languages 56 / 78
57. Apache Pig Pig Latin
Data Processing Operators
The COGROUP command
Output of a COGROUP contains one tuple for each group
First field (group) is the group identifier (the value of the
queryString)
Each of the next fields is a bag, one for each group being
co-grouped
Grouping can be performed according to UDFs
Next: a clarifying example
Pietro Michiardi (Eurecom) High-level Programming Languages 57 / 78
59. Apache Pig Pig Latin
Data Processing Operators
COGROUP vs JOIN
JOIN vs. COGROUP
Their are equivalent: JOIN == COGROUP followed by a cross
product of the tuples in the nested bags
Example 3: Suppose we try to attribute search revenue to
search-results urls → compute monetary worth of each url
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
url_revenues = FOREACH grouped_data GENERATE
FLATTEN(distrubteRevenue(results, revenue));
Where distrubteRevenue is a UDF that accepts search results
and revenue information for each query string, and outputs a bag of
urls and revenue attributed to them
Pietro Michiardi (Eurecom) High-level Programming Languages 59 / 78
60. Apache Pig Pig Latin
Data Processing Operators
COGROUP vs JOIN
More details on the UDF distribute Revenue
Attributes revenue from the top slot entirely to the first search result
The revenue from the side slot may be equally split among all
results
Let’s see how to do the same with a JOIN
JOIN the tables results and revenues by queryString
GROUP BY queryString
Apply a custom aggregation function
What happens behind the scenes
During the JOIN, the system computes the cross product of the
search and revenue information
Then the custom aggregation needs to undo this cross product,
because the UDF specifically requires so
Pietro Michiardi (Eurecom) High-level Programming Languages 60 / 78
61. Apache Pig Pig Latin
Data Processing Operators
COGROUP in details
The COGROUP statement conforms to an algebraic language
The operator carries out only the operation of grouping together
tuples into nested bags
The user can the decide whether to apply a (custom) aggregation
on those tuples or to cross-product them and obtain a JOIN
It is thanks to the nested data model that COGROUP is an
independent operation
Implementation details are tricky
Groups can be very large (and are redundant)
Pietro Michiardi (Eurecom) High-level Programming Languages 61 / 78
62. Apache Pig Pig Latin
Data Processing Operators
JOIN in Pig Latin
In many cases, the typical operation on two or more datasets
amounts to an equi-join
IMPORTANT NOTE: large datasets that are suitable to be analyzed
with Pig (and MapReduce) are generally not normalized
→ JOINs are used more infrequently in Pig Latin than they are in SQL
The syntax of a JOIN
join_result = JOIN results BY queryString,
revenue BY queryString;
This is a classic inner join (actually an equi-join), where each match
between the two relations corresponds to a row in the
join_result
Pietro Michiardi (Eurecom) High-level Programming Languages 62 / 78
63. Apache Pig Pig Latin
Data Processing Operators
JOIN in Pig Latin
JOINs lend themselves to optimization opportunities
Active development of several join flavors is on-going
Assume we join two datasets, one of which is considerably
smaller than the other
For instance, suppose a dataset fits in memory
Fragment replicate join
Syntax: append the clause USING “replicated” to a JOIN
statement
Uses a distributed cache available in Hadoop
All mappers will have a copy of the small input
→ This is a Map-side join
Pietro Michiardi (Eurecom) High-level Programming Languages 63 / 78
64. Apache Pig Pig Latin
Data Processing Operators
MapReduce in Pig Latin
It is trivial to express MapReduce programs in Pig Latin
This is achieved using GROUP and FOREACH statements
A map function operates on one input tuple at a time and outputs a
bag of key-value pairs
The reduce function operates on all values for a key at a time to
produce the final result
Example
map_result = FOREACH input GENERATE
FLATTEN(map(*));
key_groups = GROUP map_results BY $0;
output = FOREACH key_groups GENERATE reduce(*);
where map() and reduce() are UDFs
Pietro Michiardi (Eurecom) High-level Programming Languages 64 / 78
65. Apache Pig Pig Execution Engine
The Pig Execution
Engine
Pietro Michiardi (Eurecom) High-level Programming Languages 65 / 78
66. Apache Pig Pig Execution Engine
Pig Execution Engine
Pig Latin Programs are compiled into MapReduce jobs, and
executed using Hadoop5
Overview
How to build a logical plan for a Pig Latin program
How to compile the logical plan into a physical plan of MapReduce
jobs
Optimizations
5
Other execution engines are allowed, but require a lot of implementation effort.
Pietro Michiardi (Eurecom) High-level Programming Languages 66 / 78
67. Apache Pig Pig Execution Engine
Building a Logical Plan
As clients issue Pig Latin commands (interactive or batch
mode)
The Pig interpreter parses the commands
Then it verifies validity of input files and bags (variables)
E.g.: if the command is c = COGROUP a BY ..., b BY ...;, it
verifies if a and b have already been defined
Pig builds a logical plan for every bag
When a new bag is defined by a command, the new logical plan is a
combination of the plans for the input and that of the current
command
Pietro Michiardi (Eurecom) High-level Programming Languages 67 / 78
68. Apache Pig Pig Execution Engine
Building a Logical Plan
No processing is carried out when constructing the logical
plans
Processing is triggered only by STORE or DUMP
At that point, the logical plan is compiled to a physical plan
Lazy execution model
Allows in-memory pipelining
File reordering
Various optimizations from the traditional RDBMS world
Pig is (potentially) platform independent
Parsing and logical plan construction are platform oblivious
Only the compiler is specific to Hadoop
Pietro Michiardi (Eurecom) High-level Programming Languages 68 / 78
69. Apache Pig Pig Execution Engine
Building the Physical Plan
Compilation of a logical plan into a physical plan is “simple”
MapReduce primitives allow a parallel GROUP BY
Map assigns keys for grouping
Reduce process a group at a time (actually in parallel)
How the compiler works
Converts each (CO)GROUP command in the logical plan into
distinct MapReduce jobs
Map function for (CO)GROUP command C initially assigns keys to
tuples based on the BY clause(s) of C
Reduce function is initially a no-op
Pietro Michiardi (Eurecom) High-level Programming Languages 69 / 78
70. Apache Pig Pig Execution Engine
Building the Physical Plan
ce
ig
at
ng
ds
e.
er
e-
if
i-
ig
s.
n
ns
Figure 3: Map-reduce compilation of Pig Latin.
Parallelism for LOAD is obtained since Pig operates over
files residing in the Hadoop distributed file system. We also
automatically get parallelism for FILTER and FOREACH oper-
ations since for a given map-reduce job, several map and re-
duce instances are run in parallel. Parallelism for (CO)GROUP
MapReduce boundary is the COGROUP command
The sequence of FILTER and FOREACH from the LOAD to the first
COGROUP C1 are pushed in the Map function
The commands in later COGROUP commands Ci and Ci+1 can be
pushed into:
the Reduce function of Ci
the Map function of Ci+1
Pietro Michiardi (Eurecom) High-level Programming Languages 70 / 78
71. Apache Pig Pig Execution Engine
Building the Physical Plan
Pig optimization for the physical plan
Among the two options outlined above, the first is preferred
Indeed, grouping is often followed by aggregation
→ reduces the amount of data to be materialized between jobs
COGROUP command with more than one input dataset
Map function appends an extra field to each tuple to identify the
dataset
Reduce function decodes this information and inserts tuple in the
appropriate nested bags for each group
Pietro Michiardi (Eurecom) High-level Programming Languages 71 / 78
72. Apache Pig Pig Execution Engine
Building the Physical Plan
How parallelism is achieved
For LOAD this is inherited by operating over HDFS
For FILTER and FOREACH, this is automatic thanks to MapReduce
framework
For (CO)GROUP uses the SHUFFLE phase
A note on the ORDER command
Translated in two MapReduce jobs
First job: Samples the input to determine quantiles of the sort key
Second job: Range partitions the input according to quantiles,
followed by sorting in the reduce phase
Known overheads due to MapReduce inflexibility
Data materialization between jobs
Multiple inputs are not supported well
Pietro Michiardi (Eurecom) High-level Programming Languages 72 / 78
73. Apache Pig Pig Execution Engine
Summary
Physical Plan
Logical Plan
Parser
Query Plan
Compiler
Cross-Job
Optimizer
MapReduce
Compiler
CLUSTER
Pig Latin Program
MapReduce Program
B(x,y)A(x,y)
FILTER
JOIN
UDF
output
Pietro Michiardi (Eurecom) High-level Programming Languages 73 / 78
74. Apache Pig Pig Execution Engine
Single-program Optimizations
Logical optimizations: query plan
Early projection
Early filtering
Operator rewrites
Physical optimization: execution plan
Mapping of logical operations to MapReduce
Splitting logical operations in multiple physical ones
Join execution strategies
Pietro Michiardi (Eurecom) High-level Programming Languages 74 / 78
75. Apache Pig Pig Execution Engine
Efficiency measures
(CO)GROUP command places tuples of the same group in
nested bags
Bag materialization (I/O) can be avoided
This is important also due to memory constraints
Distributive or algebraic aggregation facilitate this task
What is an algebraic function?
Function that can be structured as a tree of sub-functions
Each leaf sub-function operates over a subset of the input data
→ If nodes in the tree achieve data reduction, then the system can
reduce materialization
Examples: COUNT, SUM, MIN, MAX, AVERAGE, ...
Pietro Michiardi (Eurecom) High-level Programming Languages 75 / 78
76. Apache Pig Pig Execution Engine
Efficiency measures
Pig compiler uses the combiner function of Hadoop
A special API for algebraic UDF is available
There are cases in which (CO)GROUP is inefficient
This happens with non-algebraic functions
Nested bags can be spilled to disk
Pig provides a disk-resident bag implementation
Features external sort algorithms
Features duplicates elimination
Pietro Michiardi (Eurecom) High-level Programming Languages 76 / 78
78. References
References I
[1] Pig wiki.
http://wiki.apache.org/pig/.
[2] C. Olston, B. Reed, U. Srivastava, R. Kumar, , and A. Tomkins.
Pig latin: A not-so-foreign language for data processing.
In Proc. of ACM SIGMOD, 2008.
[3] Tom White.
Hadoop, The Definitive Guide.
O’Reilly, Yahoo, 2010.
Pietro Michiardi (Eurecom) High-level Programming Languages 78 / 78