Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail
Below topics are explained in this Hive presetntation:
1. History of Hive
2. What is Hive?
3. Architecture of Hive
4. Data flow in Hive
5. Hive data modeling
6. Hive data types
7. Different modes of Hive
8. Difference between Hive and RDBMS
9. Features of Hive
10. Demo on HiveQL
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail
Below topics are explained in this Hive presetntation:
1. History of Hive
2. What is Hive?
3. Architecture of Hive
4. Data flow in Hive
5. Hive data modeling
6. Hive data types
7. Different modes of Hive
8. Difference between Hive and RDBMS
9. Features of Hive
10. Demo on HiveQL
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
MongoDB is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB has official drivers for a variety of popular programming languages and development environments. There are also a large number of unofficial or community-supported drivers for other programming languages and frameworks.
SonarQube is an open platform to manage code quality. It has got a very efficient way of navigating, a balance between high-level view, dashboard, TimeMachine and defect hunting tools.
SonarQube tool is a web-based application. Rules, alerts, thresholds, exclusions, settings… can be configured online.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. PIG Latin
• Pig Latin is a data flow language used for exploring large data sets.
• Rapid development
• No Java is required.
• Its is a high-level platform for creating MapReduce programs used
with Hadoop.
• Pig was originally developed at Yahoo Research around 2006 for
researchers to have an ad-hoc way of creating and executing map-
reduce jobs on very large data sets. In 2007,it was moved into the
Apache Software Foundation
• Like actual pigs, who eat almost anything, the Pig programming
language is designed to handle any kind of data—hence the name!
3. What Pig Does
Pig was designed for performing a long series of data operations,
making it ideal for three categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
4. Features of PIG
• Provides support for data types – long, float, chararray, schemas
and functions
• Is extensible and supports User Defined Functions
• Schema not mandatory, but used when available
• Provides common operations like JOIN, GROUP, FILTER, SORT
5. When not to use PIG
• Really nasty data formats or complete unstructured data.
– Video Files
– Audio Files
– Image Files
– Raw human readable text
• PIG is slow compared to Map-Reduce
• When you need more power to optimize code.
8. I Install PIG
•To install pig
• untar the .gz file using tar –xvzf pig-0.13.0-bin.tar.gz
•To initialize the environment variables, export the following:
• export PIG_HADOOP_VERSION=20
(Specifies the version of hadoop that is running)
• export HADOOP_HOME=/home/(user-name)/hadoop-0.20.2
(Specifies the installation directory of hadoop to the environment
variable HADOOP_HOME. Typically defined as /home/user-
name/hadoop-version)
• export PIG_CLASSPATH=$HADOOP_HOME/conf
(Specifies the class path for pig)
• export PATH=$PATH:/home/user-name/pig-0.13.1/bin
(for setting the PATH variable)
• export JAVA_HOME=/usr
(Specifies the java home to the environment variable.)
9. PIG Modes
• Pig in Local mode
– No HDFS is required, All files run on local file system.
– Command: pig –x local
• Pig in MapReduce(hadoop) mode
– To run PIG scripts in MR mode, ensure you have access to
HDFS, By Default, PIG starts in MapReduce Mode.
– Command: pig –x mapreduce or pig
10. PIG Program Structure
• Grunt Shell or Interactive mode
– Grunt is an interactive shell for running PIG commands.
• PIG Scripts or Batch mode
– PIG can run a script file that contains PIG commands.
– E.g. PIG script.pig
11. Introducing data types
• Data type is a data storage format that can contain a specific type or
range of values.
– Scalar types
• Sample: int, long, double, chararray, bytearray
– Complex types
• Sample: Atom, Tuple, Bag, Map
12. • User can declare data type at load time as below.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno:chararray,
name: chararray, marks:long);
• If data type is not declared but script treats value as a certain type,
Pig will assume it is of that type and cast it.
– A= LOAD ‘test.data’ using PigStorage(',') AS (sno, name,
marks);
– B = FOREACH A GENERATE marks* 100; --marks cast to long
13.
14. Data types continues…
Relation can be defined as follows:
• A field/Atom is a piece of data.
Ex:12.5 or hello world
• A tuple is an ordered set of fields.
EX: Tuple (12.5,hello world,-2)
It’s most often used as a row in a relation.
It’s represented by fields separated by commas, enclosed by
parentheses.
15. • A bag is a collection of tuples.
Bag {(12.5,hello world,-2),(2.87,bye world,10)}
A bag is an unordered collection of tuples.
A bag is represented by tuples separated by commas, all
enclosed by curly
• Map [key value]
A map is a set of key/value pairs.
Keys must be unique and be a string (chararray).
The value can be any type.
16. In sort ..
Relations, Bags, Tuples, Fields
Pig Latin statements work with relations, A relation can be defined as
follows:
• A relation is a bag (more specifically, an outer bag).
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
17. PIG Latin Statements
• A Pig Latin statement is an operator that takes a relation as input
and produces another relation as output.
• This definition applies to all Pig Latin operators except LOAD and
STORE command which read data from and write data to the file
system.
• In PIG when a data element is null it means its unknown. Data of
any type can be null.
• Pig Latin statements can span multiple lines and must end with a
semi-colon ( ; )
18. PIG The programming language
• Pig Latin statements are generally organized in the following
manner:
– A LOAD statement reads data from the file system.
– A series of "transformation" statements process the data.
– A STORE statement writes output to the file system;
OR
– A DUMP statement displays output to the screen
19. MULTIQUERY EXECUTION
•Because DUMP is a diagnostic tool, it will always trigger execution.
However, the STORE command is different.
• In interactive mode, STORE acts like DUMP and will always trigger
execution (this includes the run command), but in batch mode it will not
(this includes the exec command).
•The reason for this is efficiency. In batch mode, Pig will parse the
whole script to see whether there are any optimizations that could be
made to limit the amount of data to be written to or read from disk.
20. Consider the following simple example:
• A = LOAD 'input/pig/multiquery/A';
• B = FILTER A BY $1 == 'banana';
• C = FILTER A BY $1 != 'banana';
• STORE B INTO 'output/b';
• STORE C INTO 'output/c';
Relations B and C are both derived from A, so to save reading A twice,
Pig can run this script as a single MapReduce job by reading A once
and writing two output files from the job, one for each of B and C. This
feature is called multiquery execution.
24. Logical vs. Physical Plan
When the Pig Latin interpreter sees the first line containing the LOAD
statement, it confirms that it is syntactically and semantically correct
and adds it to the logical plan, but it does not load the data from the file
(or even check whether the file exists).
The point is that it makes no sense to start any processing until the
whole flow is defined. Similarly, Pig validates the GROUP and
FOREACH…GENERATE statements, and adds them to the logical
plan without executing them. The trigger for Pig to start execution is the
DUMP statement. At that point, the logical plan is compiled into a
physical plan and executed.
26. Create a sample file
John,18,4.0
Mary,19,3.8
Bill,20,3.9
Joe,18,3.8
Save it as “student.txt”
Move it to HDFS by using below command.
hadoop fs – put <local path - filename> hdfspath
27. LOAD/DUMP/STORE
A = load 'student' using PigStorage(‘,’) AS
(name:chararray,age:int,gpa:float);
DESCRIBE A;
A: {name: chararray,age: int,gpa: float}
DUMP A;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
store A into ‘/hdfspath’;
28. Group
Groups the data in one relations.
B = GROUP A BY age;
DUMP B;
(18,{(John,18,4.0),(Joe,18,3.8)})
(19,{(Mary,19,3.8)})
(20,{(Bill,20,3.9)})
29. Foreach…Generate
C = FOREACH B GENERATE group, COUNT(A);
DUMP C;
(18,2)
(19,1)
(20,1)
C = FOREACH B GENERATE $0, $1.name;
DUMP C;
(18,{(John),(Joe)})
(19,{(Mary)})
(20,{(Bill)})
30. Create Sample File
FileA.txt
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
Move it to HDFS by using below command.
hadoop fs – put <localpath> <hdfspath>
31. Create another Sample File
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
Move it to HDFS by using below command.
hadoop fs – put localpath hdfspath
32. Filter
Definition: Selects tuples from a relation based on some condition.
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
Examples
A = LOAD 'data' using PigStorage(‘,’) AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
33. Co-Group
Definition: The GROUP and COGROUP operators are identical. For
readability GROUP is used in statements involving one relation and
COGROUP is used in statements involving two or more relations.
X = COGROUP A BY $0, B BY $0;
(1, {(1, 2, 3)}, {(1, 3)})
(2, {}, {(2, 4), (2, 7), (2, 9)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6),(4, 9)})
(7, {(7, 2, 5)}, {})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
•To see groups for which inputs have at least one tuple:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
FileA
1 2 3
4 2 1
8 3 4
4 3 3
7 2 5
8 4 3
FileB.txt
2 4
8 9
1 3
2 7
2 9
4 6
4 9
34. Flatten Operator
• Flatten un-nests tuples as well as bags.
• For tuples, flatten substitutes the fields of a tuple in place of the tuple.
• For example, consider a relation (a, (b, c)).
• GENERATE $0, flatten($1)
– (a, b, c).
• For bags, flatten substitutes bags with new tuples.
• For Example, consider a bag ({(b,c),(d,e)}).
• GENERATE flatten($0),
– will end up with two tuples (b,c) and (d,e).
• When we remove a level of nesting in a bag, sometimes we cause a cross product to
happen.
• For example, consider a relation (a, {(b,c), (d,e)})
• GENERATE $0, flatten($1),
– it will create new tuples: (a, b, c) and (a, d, e).
35. JOIN
Definition: Performs join of two or more relations based on common field values
Syntax:
X= JOIN A BY $0, B BY $0;
which is equivalent to:
X = COGROUP A BY $0 INNER, B BY $0 INNER;
Y = FOREACH X GENERATE FLATTEN(A), FLATTEN(B);
The result is:
(1, 2, 3, 1, 3)
(4, 2, 1, 4, 6)
(4, 3, 3, 4, 6)
(4, 2, 1, 4, 9)
(4, 3, 3, 4, 9)
(8, 3, 4, 8, 9)
(8, 4, 3, 8, 9)
(1, {(1, 2, 3)}, {(1, 3)})
(4, {(4, 2, 1), (4, 3, 3)}, {(4, 6), (4, 9)})
(8, {(8, 3, 4), (8, 4, 3)}, {(8, 9)})
37. CROSS
•Computes the cross product of two or more relations.
Example: X = CROSS A, B;
(1, 2, 3, 2, 4)
(1, 2, 3, 8, 9)
(1, 2, 3, 1, 3)
(1, 2, 3, 2, 7)
(1, 2, 3, 2, 9)
(1, 2, 3, 4, 6)
(1, 2, 3, 4, 9)
(4, 2, 1, 2, 4)
(4, 2, 1, 8, 9)
38. SPLIT
Partitions a relation into two or more relations.
Example: A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A; (1,2,3) (4,5,6) (7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3) (7,8,9)
39. Some more commands
• To select few columns from one dataset
– S1 = foreach a generate a1, a1;
• Simple calculation on dataset
– K = foreach A generate $1, $2, $1*$2;
• To display only 100 records
– B = limit a 100;
• To see the structure/Schema
– Describe A;
• To Union two datasets
– C = UNION A,B;
40. Word Count Program
Create a basic wordsample.txt file and move to
HDFS
x = load '/home/pgupta5/prashant/data.txt';
y = foreach x generate flatten (TOKENIZE ((chararray) $0))
as word;
z = group y by word;
counter = foreach z generate group, COUNT(y);
store counter into ‘/NewPigData/WordCount’;
41.
42. Another Example
i/p: webcount
en google.com 70 2012
en yahoo.com 60 2013
us google.com 80 2012
en google.com 40 2014
us google.com 80 2012
records = LOAD ‘webcount’ using PigStorage (‘t’) as (country:chararray,
name:chararray, pagecount:int, year:int);
filtered_records = filter records by country == ‘en’;
grouped_records = group filtered_records by name;
results = foreach grouped_records generate group, SUM
(filtered_records.pagecount);
sorted_result = order results by $1 desc;
store sorted_result into ‘/some_external_HDFS_location//data’; -- Hive
external table path
43. Find Maximum Score
i/p: CricketScore.txt
a = load '/user/cloudera/SampleDataFile/CricketScore.txt'
using PigStorage('t');
b = foreach a generate $0, $1;
c = group b by $0;
d = foreach c generate group, max(b.$1);
dump d;
44. Sorting Data
Relations are unordered in Pig.
Consider a relation A:
• grunt> DUMP A;
• (2,3)
• (1,2)
• (2,4)
There is no guarantee which order the rows will be processed in. In particular, when
retrieving the contents of A using DUMP or STORE, the rows may be written in any
order. If you want to impose an order on the output, you can use the ORDER
operator to sort a relation by one or more fields.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
• (1,2)
• (2,4)
• (2,3)
Any further processing on a sorted relation is not guaranteed to retain its order.
45. Using Hive tables with HCatalog
• HCatalog (which is a component of Hive) provides
access to Hive’s metastore, so that Pig queries can
reference schemas each time.
• For example, after running through An Example to load
data into a Hive table called records, Pig can access the
table’s schema and data as follows:
• pig -useHCatalog
• grunt> records = LOAD ‘School_db.student_tbl'
USING org.apache.hcatalog.pig.HCatLoader();
• grunt> DESCRIBE records;
• grunt> DUMP records;
46. PIG UDFs
Pig provides extensive support for user defined functions (UDFs) to
specify custom processing.
REGISTER - Registers the JAR file with PIG runtime.
REGISTER myudfs.jar;
//JAR file should be available in local LINUX.
A = LOAD 'student_data‘ using PigStorage(‘,’) AS (name: chararray,
age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name);
DUMP B;
47. UDF Sample Program
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
• (Pig’s Java UDF extends functionalities of EvalFunc)
48. Diagnostic operator
• DESCRIBE: Prints a relation’s schema.
• EXPLAIN: Prints the logical and physical plans.
• ILLUSTRATE: Shows a sample execution of the logical
plan, using a generated subset of the input.
49. Performance Tuning
Pig does not (yet) determine when a field is no longer needed and drop the field from the
row. For example, say you have a query like:
• Project Early and Often
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
• There is no need for v, y, or z to participate in this query. And there is no need to
carry both t and x past the join, just one will suffice. Changing the query above to the
query below will greatly reduce the amount of data being carried through the map and
reduce phases by pig.
– A = load 'myfile' as (t, u, v);
– A1 = foreach A generate t, u;
– B = load 'myotherfile' as (x, y, z);
– B1 = foreach B generate x;
– C = join A1 by t, B1 by x;
– C1 = foreach C generate t, u;
– D = group C1 by u;
– E = foreach D generate group, COUNT($1);
50. Performance Tuning
As with early projection, in most cases it is beneficial to apply filters as early as possible
to reduce the amount of data flowing through the pipeline.
• Filter Early and Often
-- Query 1
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = filter A by t == 1;
– D = join C by t, B by x;
– E = group D by u;
– F = foreach E generate group, COUNT($1);
-- Query 2
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
– D = group C by u;
– E = foreach D generate group, COUNT($1);
– F = filter E by C.t == 1;
• The first query is clearly more efficient than the second one because
it reduces the amount of data going into the join.
51. Performance Tuning
Often you are not interested in the entire output but rather a
sample or top results. In such cases, LIMIT can yield a
much better performance as we push the limit as high as
possible to minimize the amount of data travelling through
the pipeline.
• Use the LIMIT Operator
– A = load 'myfile' as (t, u, v);
– B = order A by t;
– C = limit B 500;
52. Performance Tuning
If types are not specified in the load statement, Pig assumes the
type of double for numeric computations. A lot of the time, your
data would be much smaller, maybe, integer or long. Specifying
the real type will help with speed of arithmetic computation.
• Use Types
– --Query 1
• A = load 'myfile' as (t, u, v);
• B = foreach A generate t + u;
– --Query 2
• A = load 'myfile' as (t: int, u: int, v);
• B = foreach A generate t + u;
• The second query will run more efficiently than the first. In
some of our queries with see 2x speedup.
53. Performance Tuning
• Use Joins appropriately.
– Understand Skewed Vs. Replicated vs. Merge join.
• Remove null values before join.
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C = join A by t, B by x;
• is rewritten by Pig to
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– C1 = cogroup A by t INNER, B by x INNER;
– C = foreach C1 generate flatten(A), flatten(B);
Since the nulls from A and B won't be collected together,
when the nulls are flattened we're guaranteed to have an
empty bag, which will result in no output. But they will not
be dropped until the last possible moment.
54. Performance Tuning
• Hence the previous query should be rewritten as
– A = load 'myfile' as (t, u, v);
– B = load 'myotherfile' as (x, y, z);
– A1 = filter A by t is not null;
– B1 = filter B by x is not null;
– C = join A1 by t, B1 by x;
Now nulls will be dropped before the join. Since all null
keys go to a single reducer, if your key is null even a small
percentage of the time the gain can be significant.
55. Performance Tuning
• You can set the number of reduce tasks for the
MapReduce jobs generated by Pig using parallel
reducer feature.
– set default parallel command is used at the script
level.
• In this example all the MapReduce jobs gets launched use 20
reducers.
– SET default_parallel 20;
– A = LOAD ‘myfile.txt’ USING PigStorage() AS (t, u, v);
– B = GROUP A BY t;
– C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
– D = ORDER C BY mycount;
– PARALLEL clause can be used with any operator like
group, cogroup, join, order by, distinct that starts
reduce phase.
56. Replicated Join
• One of the datasets is small enough that it fits in the memory.
• A replicated join copies the small dataset to the distributed cache -
space that is available on every cluster machine - and loads it into
the memory.
• Coz the data is available in the memory(DC), and is processed on
the map side of MapReduce, this operation works much faster than
a default join.
57. • Limitations
It isn’t clear how small the dataset needs to be for using replicated join.
According to the Pig documentation, a relation of up to 100 MB can
be used when the process has 1 GB of memory. A run-time error will
be generated if not enough memory is available for loading the data.
58. • transactions = load 'customer_transactions' as ( fname, lname, city,
state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district, manager);
Regular join
• sales = join transactions by (state, country), geography by (state,
country);
• sales = join transactions by (state, country), geography by (state,
country) using 'replicated';
59. Skewed Join
• One of the keys is much more common than others, and the data for
it is too large to fit in the memory.
• Standard joins run in parallel across different reducers by splitting
key values across processes. If there is a lot of data for a certain
key, the data will not be distributed evenly across the reducers, and
one of them will be ‘stuck’ processing the majority of data.
• Skewed join handles this case. It calculates a histogram to check
which key is the most prevalent and then splits its data across
different reducers for optimal performance.
60. • transactions = load 'customer_transactions' as ( fname,
lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country, district,
manager);
• sales = join transactions by (state, country), geography
by (state, country) using 'skewed';
61. Merge Join
• The two datasets are both sorted in ascending order by the join key.
• Datasets may already be sorted by the join key if that’s the order in
which data was entered or they have undergone sorting before the
join operation for other needs.
• When merge join receives the pre-sorted datasets, they are read
and compared on the map side, and as a result they run faster. Both
inner and outer join are available.
•
62. • transactions = load 'customer_transactions' as (
fname, lname, city, state, country, amount, tax);
• geography = load 'geo_data' as (state, country,
district, manager);
• sales = join transactions by (state, country),
geography by (state, country) using 'merge';
Pig is made up of two components: the first is the language itself, which is called PigLatin (yes, people naming various Hadoop projects do tend to have a sense of humor associated with their naming conventions), and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a Java Virtual Machine (JVM) and a Java application.
As the example is written, this job will requires both a Map & Reduce job to successfully make the join work which leads to larger and larger inefficiency as the customer data set grows in size. This is the exact scenario that is optimized by using a Replicated join.
The replicated join, tells Pig to distribute the geography set to each node, where it can be join directly in the Map job and eliminates the need for the Reduce job altogether.
Skewed join supports both inner and outer join, though only with two inputs - joins between additional tables should be broken up into further joins. Also, there is a pig.skwedjoin.reduce.memusage Java parameter that specifies the heap fraction available to reducers in order to perform this join. Setting a low value means more reducers will be used, yet the cost of copying the data across them will increase. Pig’s developers claim to have good performance when setting it between 0.1-0.4,