The document provides information about Pig Latin, the data processing language used with Apache Pig. It discusses Pig Latin basics like the data model, relations, tuples, and fields. It also covers Pig Latin statements, loading and storing data, data types, relational operations like group, join, cross, union, and diagnostic operators like dump and describe.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
This document discusses concepts related to data streams and real-time analytics. It begins with introductions to stream data models and sampling techniques. It then covers filtering, counting, and windowing queries on data streams. The document discusses challenges of stream processing like bounded memory and proposes solutions like sampling and sketching. It provides examples of applications in various domains and tools for real-time data streaming and analytics.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Hadoop DFS consists of HDFS for storage and MapReduce for processing. HDFS provides massive storage, fault tolerance through data replication, and high throughput access to data. It uses a master-slave architecture with a NameNode managing the file system namespace and DataNodes storing file data blocks. The NameNode ensures data reliability through policies that replicate blocks across racks and nodes. HDFS provides scalability, flexibility and low-cost storage of large datasets.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
This document discusses concepts related to data streams and real-time analytics. It begins with introductions to stream data models and sampling techniques. It then covers filtering, counting, and windowing queries on data streams. The document discusses challenges of stream processing like bounded memory and proposes solutions like sampling and sketching. It provides examples of applications in various domains and tools for real-time data streaming and analytics.
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
This document provides an overview of data mining techniques discussed in Chapter 3, including parametric and nonparametric models, statistical perspectives on point estimation and error measurement, Bayes' theorem, decision trees, neural networks, genetic algorithms, and similarity measures. Nonparametric techniques like neural networks, decision trees, and genetic algorithms are particularly suitable for data mining applications involving large, dynamically changing datasets.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
The document provides an overview of Big Data technology landscape, specifically focusing on NoSQL databases and Hadoop. It defines NoSQL as a non-relational database used for dealing with big data. It describes four main types of NoSQL databases - key-value stores, document databases, column-oriented databases, and graph databases - and provides examples of databases that fall under each type. It also discusses why NoSQL and Hadoop are useful technologies for storing and processing big data, how they work, and how companies are using them.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
The document discusses the Hadoop ecosystem, which includes core Apache Hadoop components like HDFS, MapReduce, YARN, as well as related projects like Pig, Hive, HBase, Mahout, Sqoop, ZooKeeper, Chukwa, and HCatalog. It provides overviews and diagrams explaining the architecture and purpose of each component, positioning them as core functionality that speeds up Hadoop processing and makes Hadoop more usable and accessible.
Introducing Technologies for Handling Big Data by JaseelaStudent
This document discusses technologies for handling big data, including distributed and parallel computing, Hadoop, cloud computing, and in-memory computing. It introduces distributed computing as using multiple connected computing resources to distribute tasks for increased speed and efficiency when processing huge amounts of data. Parallel computing improves processing capability by adding computational resources to divide complex computations into subtasks running simultaneously. Hadoop is presented as a distributed system and software library that allows processing large datasets across computer clusters. Cloud computing provides on-demand computing services over the internet, enabling scalable big data processing.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
Paging is a memory management technique that allows a process to be allocated physical memory wherever it is available. It divides both physical memory and logical memory into fixed-sized blocks called frames and pages respectively. When a process runs, its pages are loaded into available free frames. A page table maps the logical addresses to physical frames and is used to translate logical addresses to physical addresses during execution. Address translation uses the page number as an index into the page table to find the physical frame number.
This document discusses different methods for organizing and indexing data stored on disk in a database management system (DBMS). It covers unordered or heap files, ordered or sequential files, and hash files as methods for physically arranging records on disk. It also discusses various indexing techniques like primary indexes, secondary indexes, dense vs sparse indexes, and multi-level indexes like B-trees and B+-trees that provide efficient access to records. The goal of file organization and indexing in a DBMS is to optimize performance for operations like inserting, searching, updating and deleting records from disk files.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
In this session you will learn:
PIG
PIG - Overview
Installation and Running Pig
Load in Pig
Macros in Pig
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
This document provides an overview of data mining techniques discussed in Chapter 3, including parametric and nonparametric models, statistical perspectives on point estimation and error measurement, Bayes' theorem, decision trees, neural networks, genetic algorithms, and similarity measures. Nonparametric techniques like neural networks, decision trees, and genetic algorithms are particularly suitable for data mining applications involving large, dynamically changing datasets.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
The document provides an overview of Big Data technology landscape, specifically focusing on NoSQL databases and Hadoop. It defines NoSQL as a non-relational database used for dealing with big data. It describes four main types of NoSQL databases - key-value stores, document databases, column-oriented databases, and graph databases - and provides examples of databases that fall under each type. It also discusses why NoSQL and Hadoop are useful technologies for storing and processing big data, how they work, and how companies are using them.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
The document discusses the Hadoop ecosystem, which includes core Apache Hadoop components like HDFS, MapReduce, YARN, as well as related projects like Pig, Hive, HBase, Mahout, Sqoop, ZooKeeper, Chukwa, and HCatalog. It provides overviews and diagrams explaining the architecture and purpose of each component, positioning them as core functionality that speeds up Hadoop processing and makes Hadoop more usable and accessible.
Introducing Technologies for Handling Big Data by JaseelaStudent
This document discusses technologies for handling big data, including distributed and parallel computing, Hadoop, cloud computing, and in-memory computing. It introduces distributed computing as using multiple connected computing resources to distribute tasks for increased speed and efficiency when processing huge amounts of data. Parallel computing improves processing capability by adding computational resources to divide complex computations into subtasks running simultaneously. Hadoop is presented as a distributed system and software library that allows processing large datasets across computer clusters. Cloud computing provides on-demand computing services over the internet, enabling scalable big data processing.
This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
Paging is a memory management technique that allows a process to be allocated physical memory wherever it is available. It divides both physical memory and logical memory into fixed-sized blocks called frames and pages respectively. When a process runs, its pages are loaded into available free frames. A page table maps the logical addresses to physical frames and is used to translate logical addresses to physical addresses during execution. Address translation uses the page number as an index into the page table to find the physical frame number.
This document discusses different methods for organizing and indexing data stored on disk in a database management system (DBMS). It covers unordered or heap files, ordered or sequential files, and hash files as methods for physically arranging records on disk. It also discusses various indexing techniques like primary indexes, secondary indexes, dense vs sparse indexes, and multi-level indexes like B-trees and B+-trees that provide efficient access to records. The goal of file organization and indexing in a DBMS is to optimize performance for operations like inserting, searching, updating and deleting records from disk files.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
In this session you will learn:
PIG
PIG - Overview
Installation and Running Pig
Load in Pig
Macros in Pig
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Pig is a dataflow language used for performing extract-transform-load (ETL) operations on Hadoop. It has a simple syntax called Pig Latin and translates scripts into MapReduce jobs. Pig was designed for data pipelines, research on raw data, and iterative data processing. It allows operations like joining, grouping, filtering datasets. Users can also define custom functions in Java. The document provides an overview of Pig and examples demonstrating how to load data, transform it using operations like filter and group, and define user-defined functions.
The document describes Apache Pig, a platform for analyzing large datasets. It consists of Pig Latin, a high-level language for expressing data analysis programs, and infrastructure for evaluating these programs by converting them into MapReduce jobs. The document provides an overview of Pig's features and use cases, describes how to install and run Pig in local and Hadoop modes, and gives examples of Pig Latin scripts for loading, transforming, and grouping data.
Apache Pig is a platform for analyzing large datasets that operates on the Hadoop platform. It provides a high-level language called Pig Latin that allows users to express data analysis programs, which Pig then compiles into sequences of MapReduce jobs for execution. Pig Latin provides operators for common data management tasks like filtering, joining, grouping and sorting to make analyzing large datasets easier.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
The document discusses Pig Latin scripts and how to execute them. It provides examples of multi-line and single-line comments in Pig Latin scripts. It also describes how to execute Pig Latin scripts locally or using MapReduce and how to execute scripts that reside in HDFS. Finally, it summarizes key differences between Pig Latin and SQL and describes common relational operators used in Pig Latin.
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
This document provides an overview of Pig, a data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying and low-level procedural programming. Pig Latin programs are compiled into a logical plan and then optimized and executed as a series of MapReduce jobs. Compared to SQL in RDBMS, Pig Latin operates on nested data structures rather than flat tables and loads data directly from files rather than requiring an import process.
Pig is a data flow language and execution environment for exploring very large datasets. It runs on Hadoop and MapReduce clusters. Pig Latin is the language used to express data flows in Pig programs. It allows for both declarative querying using high-level constructs as well as procedural programming using low-level operations. Pig Latin programs are compiled into a logical plan and then optimized and executed as MapReduce jobs on a Hadoop cluster. Pig supports complex nested data structures and user-defined functions.
Apache Pig is a platform for analyzing large datasets that consists of a high-level data flow language called Pig Latin and an infrastructure for evaluating Pig Latin programs. Pig Latin scripts are compiled into sequences of MapReduce jobs that can run on Hadoop for large scale parallel processing. Pig aims to provide a simpler programming model than raw MapReduce while still allowing for optimization and parallelization of queries. Pig programs can be run interactively using the Grunt shell or by specifying a Pig Latin script to execute.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
This document provides an overview of Apache Pig and Pig Latin for querying large datasets. It discusses why Pig was created due to limitations in SQL for big data, how Pig scripts are written in Pig Latin using a simple syntax, and how PigLatin scripts are compiled into MapReduce jobs and executed on Hadoop clusters. Advanced topics covered include user-defined functions in PigLatin for custom data processing and sharing functions through Piggy Bank.
This document provides an introduction and overview of using Amazon's Elastic MapReduce (EMR) service for data intensive computing. It discusses uploading data to S3 storage, writing mappers and reducers in various languages like Python and streaming utilities, and executing a MapReduce job on EMR to process the data in parallel across a cluster of Amazon EC2 instances. The key steps involve loading input data to S3, defining the mapper and reducer processing logic, and downloading outputs from S3 upon job completion.
This document describes how to analyze web server log files using the Pig Latin scripting language on Apache Hadoop. It provides examples of Pig Latin scripts to analyze logs and extract insights such as the top 50 external referrers, top search terms from Bing and Google, and total requests and bytes served by hour. Pig Latin scripts allow expressing data analysis programs for large datasets in a high-level language that can be optimized and executed in parallel on Hadoop for scalability.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
The document describes an example of using Pig Latin to analyze weather data. It loads a data file with year, temperature, and quality fields for different years. It then filters the data, groups it by year, and uses a MAX function to calculate the maximum recorded temperature for each year. This provides a concise high-level summary of the key steps and goals described in the document.
The document discusses various methods for reading data into R from different sources:
- CSV files can be read using read.csv()
- Excel files can be read using the readxl package
- SAS, Stata, and SPSS files can be imported using the haven package functions read_sas(), read_dta(), and read_sav() respectively
- SAS files with the .sas7bdat extension can also be read using the sas7bdat package
This document provides an overview of Apache Hadoop and its components. It discusses what big data is and how Hadoop uses MapReduce and HDFS to process large datasets across clusters. Example use cases are presented, including logging massive amounts of data from devices. Hadoop installations and configurations are covered. The document also demonstrates how to use Pig Latin to analyze Hadoop data, with examples of common Pig statements like LOAD, FILTER, and STORE.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
2. Pig Latin-Basics
• Pig Latin is the language used to analyze
data in Hadoop using Apache Pig.
• Pig Latin – Data Model
• The data model of Pig is fully nested.
• A Relation is the outermost structure of
the Pig Latin data model. And it is a bag
where −
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
3. Pig Latin – Statemets
While processing data using Pig Latin, statements are the
basic constructs.
These statements work with relations. They include
expressions and schemas.
Every statement ends with a semicolon (;).
We will perform various operations using operators provided
by Pig Latin, through statements.
Except LOAD and STORE, while performing all other
operations, Pig Latin statements take a relation as input and
produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its
semantic checking will be carried out. To see the contents of
the schema, you need to use the Dump operator. Only after
performing the dump operation, the MapReduce job for
loading the data into the file system will be carried out.
4. Loading Data using LOAD
statement
• Example
• Given below is a Pig Latin statement,
which loads data to Apache Pig.
grunt> Student_data = LOAD
'student_data.txt' USING
PigStorage(',')as ( id:int,
firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
6. Pig Latin – Data types
Null Values
• Values for all the above data types can
be NULL. Apache Pig treats null
values in a similar way as SQL does.
• A null can be an unknown value or a
non-existent value. It is used as a
placeholder for optional values. These
nulls can occur naturally or can be
the result of an operation.
11. Apache Pig - Reading Data
• In general, Apache Pig works on top of Hadoop.
• It is an analytical tool that analyzes large
datasets that exist in the Hadoop File System.
• To analyze data using Apache Pig, we have to
initially load the data into Apache Pig.
• Steps to load data into Pig using LOAD
Function.
• Copy the local text file into HDFS file in a
directory name pig_data.
• Input file contains data separated by ‘,’(comma).
12. Loading Data using LOAD
Operator
• You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD
operator of Pig Latin.
Syntax
• The load statement consists of two parts divided by the “=” operator.
• On the left-hand side, we need to mention the name of the relation where we want
to store the data, and on the right-hand side, we have to define how we store the
data.
Given below is the syntax of the Load operator.
• Relation_name = LOAD 'Input file path' USING function as schema
• Where,
• relation_name − We have to mention the relation in which we want to store the
data.
• Input file path − We have to mention the HDFS directory where the file is stored.
(In MapReduce mode)
• function − We have to choose a function from the set of load functions provided
by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
• Schema − We have to define the schema of the data. We can define the required
schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
13. Loading Data using LOAD
Operator
Start the Pig Grunt Shell
• First of all, open the Linux terminal. Start the Pig
Grunt shell in MapReduce mode as shown below.
$ Pig –x mapreduce
Execute the Load Statement
• Now load the data from the file student_data.txt
into Pig by executing the following Pig Latin
statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.tx
t' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );
15. Apache Pig - Storing Data
• we learnt how to load data into Apache Pig. You can store the
loaded data in the file system using the store operator. This
chapter explains how to store data in Apache Pig using the
Store operator.
Syntax
• Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path '
[USING function];
Example:
Now, let us store the relation ‘student’ in the HDFS directory
“/pig_Output/” as shown below.
grunt> STORE student INTO '
hdfs://localhost:9000/pig_Output/ ' USING PigStorage (‘,’);
We can verify the output hdfs file contents using cat
command.
16. Apache Pig - Diagnostic
Operators
• The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of
diagnostic operators −
Dump operator
Describe operator
Explanation operator
Illustration operator
Dump Operator
• The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally
used for debugging Purpose.
Syntax:
grunt> Dump Relation_Name
Example
• Assume we have a file student_data.txt in HDFS.
• And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int,
firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
grunt> Dump student
• Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS.
• Note: Running the LOAD statement will not load the data into the relation STUDENT.
• Executing the Dump statement will load the data.
17. Apache Pig - Describe
Operator
• The describe operator is used to view the schema of a
relation.
Syntax:
grunt> Describe Relation_name
Example:
grunt> describe student;
• Where student is the relation name.
Output
• Once you execute the above Pig Latin statement, it will
produce the following output.
grunt> student: { id: int,firstname: chararray,lastname:
chararray,phone: chararray,city: chararray }
18. Apache Pig - Explain Operator
• The explain operator is used to display the logical,
physical, and MapReduce execution plans of a
relation.
Syntax:
grunt> explain Relation_name;
Example:
grunt> explain student;
Output:
• It will produce the following output as in the
attachment.
19. Apache Pig - Illustrate
Operator
• The illustrate operator gives you the step-by-step
execution of a sequence of statements.
Syntax:
grunt> illustrate Relation_name;
Example:
• Assume we have a relation student.
grunt> illustrate student;
Output:
• On executing the above statement, you will get the
following output.
20. Apache Pig - Group Operator
• The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
Syntax:
grunt> Group_data = GROUP Relation_name BY age;
Example:
• Assume we have a relation with name student_details with student details like id, name, age
etc.
• Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Verification:
• Verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output:
• Then you will get output displaying the contents of the relation named group_data as shown
below. Here you can observe that the resulting schema has two columns
• One is age, by which we have grouped the relation.
• The other is a bag, which contains the group of tuples, student records with the respective age.
You can see the schema of the table after grouping the data using the describe command as
shown below.
grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname:
chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}
21. Group Multiple columns
• In the same way, you can get the sample illustration of the schema
using the illustrate command as shown below.
$ Illustrate group_data;
Output:
Grouping by Multiple Columns:
We can group the data using multiple columns also as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
Group All:
• We can group the data using all columns also as shown below.
grunt> group_all = GROUP student_details All;
Now, verify the content of the relation group_all as shown below.
grunt> Dump group_all;
22. Apache Pig - Cogroup
Operator
• The COGROUP operator works more or less in the same way as the
GROUP operator. The only difference between the two operators is
that the group operator is normally used with one relation, while the
cogroup operator is used in statements involving two or more
relations.
Grouping Two Relations using Cogroup:
• Assume that we have two relations namely student_details and
employee_details in pig.
• Now, let us group the records/tuples of the relations
student_details and employee_details with the key age, as shown
below.
grunt> cogroup_data = COGROUP student_details by age,
employee_details by age;
Verification
• Verify the relation cogroup_data using the DUMP operator as shown
below.
grunt> Dump cogroup_data;
23. COGROUP Operator
Output
• It will produce the following output, displaying the contents of the
relation named cogroup_data as shown below.
• The cogroup operator groups the tuples from each relation
according to age where each group depicts a particular age value.
• For example, if we consider the 1st tuple of the result, it is grouped
by age 21. And it contains two bags −
the first bag holds all the tuples from the first relation
(student_details in this case) having age 21, and
the second bag contains all the tuples from the second relation
(employee_details in this case) having age 21.
• In case a relation doesn’t have tuples having the age value 21, it
returns an empty bag.
24. Apache Pig - Join Operator
• The JOIN operator is used to combine records from two or more relations.
• While performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys.
• When these keys match, the two particular tuples are matched, else the records
are dropped.
• Joins can be of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
• Joins in PIG are similar to SQL joins. In Pig, we will be joining the relations where
in SQL we will be joining the tables.
• We see only the syntaxes of the different joins in PIG.
Self – join:
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
• Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verification:
grunt> Dump customers3;
25. JOINS
Inner Join:
Syntax:
• grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
• Let us perform inner join operation on the two relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verification:
grunt> Dump coustomer_orders;
Left Outer Join:
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example
• Let us perform left outer join operation on the two relations customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification:
grunt> Dump outer_left;
Right Outer Join:
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example
• Let us perform right outer join operation on the two relations customers and orders as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Verification:
grunt> Dump outer_right
26. JOINS
Full Outer Join:
• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example
• Let us perform full outer join operation on the two relations customers and
orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Verification:
grunt> Dump outer_full;
Using Multiple Keys:
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name
BY (key1, key2);
Example:
grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
Verification
grunt> Dump emp;
27. Apache Pig - Cross Operator
• The CROSS operator computes the cross-product of two or more relations.
Syntax:
• grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example:
• Assume that we have two Pig relations namely customers and orders.
• Let us now get the cross-product of these two relations using the cross operator
on these two relations as shown below.
grunt> cross_data = CROSS customers, orders;
Verification:
grunt> Dump cross_data;
Output:
• It will produce the following output, displaying the contents of the relation
cross_data.
• The Output will be cross product. Each row in the relation customers will be
joined with each row of orders starting from the last record.
28. Apache Pig - Union Operator
• The UNION operator of Pig Latin is used to merge the content of two
relations. To perform UNION operation on two relations, their
columns and domains must be identical.
Syntax:
• grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Example:
• Assume that we have two relations namely student1 and student2
containing same number and same type of columns. Data in the
relations is different.
• Let us now merge the contents of these two relations using the
UNION operator as shown below.
grunt> student = UNION student1, student2;
Verification:
grunt> Dump student;
Output:
• Combine the records in the both relations student1 and student2
into the relation student.
29. Apache Pig - Split Operator
• The SPLIT operator is used to split a relation into two or more relations.
Syntax:
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1),
Relation2_name (condition2);
Example:
• Assume that we have a relation named student_details.
• Let us now split the relation into two, one listing the employees of age less than
23, and the other listing the employees having the age between 22 and 25.
SPLIT student_details into student_details1 if age<23, student_details2 if
(age>22 and age<25);
Verification:
grunt> Dump student_details1;
grunt> Dump student_details2;
Output:
• It will produce the following output, displaying the contents of the relations
student_details1 and student_details2 respectively.
30. Apache Pig - Filter Operator
• The FILTER operator is used to select the required tuples from a
relation based on a condition.
Syntax:
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example:
• Assume that we have a relation named student_details.
• Let us now use the Filter operator to get the details of the students
who belong to the city Chennai.
filter_data = FILTER student_details BY city == 'Chennai';
Verification:
• grunt> Dump filter_data;
Output:
• It will produce the following output, displaying the contents of the
relation filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
31. Apache Pig - Distinct
Operator
• The DISTINCT operator is used to remove redundant (duplicate)
tuples from a relation.
Syntax:
grunt> Relation_name2 = DISTINCT Relation_name1;
Example:
• Assume that we have a relation named student_details.
• Let us now remove the redundant (duplicate) tuples from the
relation named student_details using the DISTINCT operator, and
store it as another relation named distinct_data as shown below.
grunt> distinct_data = DISTINCT student_details;
Verification:
grunt> Dump distinct_data;
Output:
• Dump operator will display the distinct_data producing the distinct
rows from student_details table.
32. Apache Pig - Foreach
Operator
• The FOREACH operator is used to generate specified data transformations based on the column data.
• The name itself is indicating that for each element of a data bag, the respective action will be performed.
Syntax:
• grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example
• Assume that we have a relation named student_details.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
• Let us now get the id, age, and city values of each student from the relation student_details and store it into
another relation named foreach_data using the foreach operator as shown below.
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
Verification:
grunt> Dump foreach_data;
Output:
Dump operator displays the below data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
33. Apache Pig - Order By
• The ORDER BY operator is used to display the contents of a relation in a sorted
order based on one or more fields.
Syntax:
• grunt> Relation_name2 = ORDER Relation_name1 BY (ASC|DESC);
Example:
• Assume that we have a relation named student_details.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
Let us now sort the relation in a descending order based on the age of the student
and store it into another relation named order_by_data using the ORDER BY
operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
Verification:
grunt> Dump order_by_data;
Output:
Dump operator will produce the student_details data sorted by age in descending
order.
34. Apache Pig - Limit Operator
• The LIMIT operator is used to get a limited number of tuples from a relation.
Syntax:
• grunt> Result = LIMIT Relation_name required number of tuples;
Example
• Assume that we have a file named student_details.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,
city:chararray);
grunt> limit_data = LIMIT student_details 4;
Verification:
grunt> Dump limit_data;
Output:
• Dump operator will produce the data of student_details with limited number of
rows i.e; 4 rows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
35. Apache Pig - Load & Store
Functions
• The Load and Store functions in Apache Pig are used to determine how the data goes and comes out
of Pig. These functions are used with the load and store operators. Given below is the list of load and
store functions available in Pig.
44. Apache Pig - User Defined
Functions
• In addition to the built-in functions, Apache Pig provides
extensive support for User Defined Functions (UDF’s). Using
these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages,
namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
• For writing UDF’s, complete support is provided in Java and
limited support is provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the
processing like data load/store, column transformation, and
aggregation. Since Apache Pig has been written in Java, the
UDF’s written using Java language work efficiently compared
to other languages.
• In Apache Pig, we also have a Java repository for UDF’s
named Piggybank. Using Piggybank, we can access Java
UDF’s written by other users, and contribute our own UDF’s.
45. Types of UDF’s in Java
• While writing UDF’s using Java, we can create and use
the following three types of functions −
• Filter Functions − The filter functions are used as
conditions in filter statements. These functions accept a
Pig value as input and return a Boolean value.
• Eval Functions − The Eval functions are used in
FOREACH-GENERATE statements. These functions
accept a Pig value as input and return a Pig result.
• Algebraic Functions − The Algebraic functions act on
inner bags in a FOREACHGENERATE statement. These
functions are used to perform full MapReduce
operations on an inner bag.
• All UDFs must extend "org.apache.pig.EvalFunc"
• All functions must override "exec" method.
46. UDF Example
packagemyudfs;
importjava.io.IOException;
importorg.apache.pig.EvalFunc;
importorg.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
returnstr.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
Create a jar of the above code as myudfs.jar.
Now write the script in a file and save it as .pig. Here I am using script.pig.
• -- script.pig
• REGISTER myudfs.jar;
• A = LOAD 'data' AS (name: chararray, age: int, gpa: float);
• B = FOREACH A GENERATE myudfs.UPPER(name);
• DUMP B;
Finally run the script in the terminal to get the output.
47. Apache Pig - Running Scripts
• we will see how how to run Apache Pig scripts in batch
mode.
Comments in Pig Script
• While writing a script in a file, we can include
comments in it as shown below.
Multi-line comments
• We will begin the multi-line comments with '/*', end
them with '*/'.
/* These are the multi-line comments In the pig script
*/
Single –line comments
• We will begin the single-line comments with '--'.
--we can write single line comments like this.
48. Executing Pig Script in Batch
mode
Executing Pig Script in Batch mode:
• While executing Apache Pig statements in batch mode, follow the steps given below.
Step 1
• Write all the required Pig Latin statements in a single file. We can write all the Pig Latin
statements and commands in a single file and save it as .pig file.
Step 2
• Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown
below.
Local mode:
$ pig -x local Sample_script.pig
MapReduce mode:
$ pig -x mapreduce Sample_script.pig
• You can execute it from the Grunt shell as well using the exec command as shown below.
grunt> exec /sample_script.pig
• Executing a Pig Script from HDFS
• We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with
the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as
shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
49. Executing pig script from HDFS
Example:
• We have a sample script with the name sample_script.pig, in
the same HDFS directory. This file contains statements
performing operations and transformations on the student
relation, as shown below.
student = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
student_order = ORDER student BY age DESC;
student_limit = LIMIT student_order 4;
Dump student_limit;
• Let us now execute the sample_script.pig as shown below.
$./pig -x mapreduce
hdfs://localhost:9000/pig_data/sample_script.pig
• Apache Pig gets executed and gives you the output.
50. WORD COUNT EXAMPLE - PIG
SCRIPT
• How to find the number of occurrences of the words in a file using
the pig script?
Word Count Example Using Pig Script:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as
word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
The above pig script, first splits each line into words using
the TOKENIZE operator. The tokenize function creates a bag of
words. Using the FLATTEN function, the bag is converted into a
tuple. In the third statement, the words are grouped together so that
the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the
word count problem very easily.