Slide 1© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data Analytics using Pig
Slide 2© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Scope of PPT – BIG Data Analytics via PIG
ᗍ Introduction to Big Data and Hadoop
ᗍ Introduction to Pig
ᗍ Hadoop Pig Architecture
ᗍ BIG Data Analytics via Pig
ᗍ BIG Data & Hadoop Job Trends
ᗍ BIG Data & Hadoop Course Syllabus
Get Started with BIG Data & Hadoop
Slide 3© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data and its Challenges
Get Started with BIG Data & Hadoop
Slide 4© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Big Data and its Challenges
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications
Systems / Enterprises generate huge amount of
data from Terabytes to and even Petabytes of
information
It’s very difficult to manage such huge data……
Get Started with BIG Data & Hadoop
Slide 5© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Who Generates Big Data?
Have you ever wondered how Google, Facebook or LinkedIn manages to store and utilize the huge data?
Today, it is becoming a problem for all of us to manage such BIG DATA…. Get Started with BIG Data & Hadoop
Slide 6© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop can be used for easy processing of such huge Data…..
We will answer how?
Before that let’s understand what is Hadoop?
Get Started with BIG Data & Hadoop
Slide 7© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Hadoop and its Characteristics
Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of
commodity computers using a simple programming model
It is an Open-source Data Management technology with scale-out storage and distributed processing
Hadoop
Characteristics
Flexible
Reliable
Economical
Scalable Get Started with BIG Data & Hadoop
Slide 8© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data
Apache Oozie (Workflow)
HDFS
(Hadoop Distributed File System)
Pig Latin
Data Analysis
Hive
DW System
MapReduce Framework HBase
Other
YARN
Frameworks (MPI,
GIRAPH)
YARN
Cluster Resource Management
Hadoop Ecosystem
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 9© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Need for Pig
Java is not a preferred language
for many data analysts
200 Java LOC ~ 10 Pig LOC Many built-in operations are
available for common data
operations like join,
grouping, filtering etc.
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 10© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Where to use Pig?
Pig is a Data Flow language, thus it is most suitable for:
ᗍ Quickly changing data processing requirements
ᗍ Processing data from multiple channels
ᗍ Quick hypothesis testing
ᗍ Time sensitive data refreshes
ᗍ Data profiling using sampling
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 11© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
What is Pig?
ᗍ It is an open source data flow language
ᗍ Pig Latin is used to express the queries and data manipulation operations in simple scripts
ᗍ Pig converts the scripts into a sequence of underlying Map Reduce jobs
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 12© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Let’s internalize Pig
Let’s find out people who “overall” visit “highly ranked” pages
User URL Time
John www.cbn.com 7:00
John www.trap.com 7:05
John www.myblog.com 9:00
John www.flickr.com 9:05
Linda cnn.com/index.htm 11:00
Visits
Page URL Page Rank
www.cbn.com 0.9
www.flickr.com 0.9
www.myblog.com 0.6
www.trap.com 0.3
Pages
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 13© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Internalizing Pig
Join
url = url
Load
Visits (user, url, time)
Load
Pages (url, pagerank)
Group by User
Compute Average
Pagerank
Group by User
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 14© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Pig in Industry
Since Pig is a data flow language, it naturally suits for:
ᗍ Data factory operations
ᗍ Typically data is brought from multiple servers to HDFS
ᗍ Pig is used for cleaning the data and preprocessing it
ᗍ It helps data analysts and researchers for quickly prototyping their theories
ᗍ Since Pig is extensible, it becomes way easier for data analysts to spawn their scripting
language programs (like Ruby, Python programs) effectively against large data sets
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 15© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Ways to Handle Pig
ᗍ Grunt Mode:
• It’s interactive mode of Pig
• Very useful for testing syntax checking and ad-hoc data
exploration
ᗍ Script Mode:
• Runs set of instructions from a file
• Similar to a SQL script file
ᗍ Embedded Mode:
• Executes Pig programs from a Java program
• Suitable to create Pig Scripts on the fly
Script
Grunt
Embedded
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 16© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Modes of Pig
All of the different Pig invocations can run in the following modes:
Local
ᗍ In this mode, entire Pig job runs as a single JVM process
ᗍ Picks and stores data from local Linux path
Map Reduce
ᗍ In this mode, Pig job runs as a series of map reduce jobs
ᗍ Input and output paths are assumed as HDFS paths
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 17© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Pig Components
Pig Data Flows
Pig Latin is used to
express data flows
Execution
Environments
Distributed execution
on a Hadoop Cluster
Local execution in a
single JVM
1.
2.
Get Started with BIG Data & Hadoop
© 2015 Blue Camphor Technologies (P) Ltd. Slide 18© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Pig is just a wrapper on top of Map Reduce layer
It parses, optimizes and converts the Pig script to a series of Map Reduce jobs
Pig A series of MapReduce Jobs
Turns the transformations into…
Pig Programs Execution
Get Started with BIG Data & Hadoop
Slide 19© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Job Trends – Hadoop
Get Started with BIG Data & Hadoop
Slide 20© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Why SkillSpeed?
Course
Curriculum
from Industry
Experts
Instructor Led
Live Virtual
Sessions
Lifetime access
to Course
Content via
LMS
100% Placement
Assistance
24x7 Support
Get Started with BIG Data & Hadoop
Slide 21© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Course Topics
Module 1
Introduction to Big
Data and Hadoop
Module 2
HDFS Internals, Hadoop
Configurations and
Data Loading
Module 3
Introduction to Map
Reduce
Module 4
Advanced Map Reduce
Concepts
Module 5
Introduction to Pig
Module 6
Advanced Pig and
Introduction to Hive
Module 7
Advanced Hive
Concepts
Module 8
Extending Hive and
HBase Introduction
Module 9
Advanced HBase and
Oozie Introduction
Module 10
Project Set-up
Discussion
Get Started with BIG Data & Hadoop
Slide 22© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Corporate Partners
Get Started with BIG Data & Hadoop
Slide 23© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Lines open 24/7
To know more about the course, Please contact:
IND +91-90660-20904 USA 1866-607-6547 (Toll Free)
Or reach us at
sales@skillspeed.com
Contact Us
Get Started with BIG Data & Hadoop
Slide 24© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com
Image References
Google images – credit for google, Facebook and LinkedIn LOGO and Snapshots
http://pixshark.com/big-data-comic.htm
http://findicons.com/icon/66444/user_group
http://www.virtualizor.com/tour
https://accounts.it.et.byu.edu/
http://www.clipartsfree.net/tag/server.html
http://www.gopixpic.com/16/time-clock-icon-png-download
http://blog.smartbear.com/requirements/how-to-interview-users-to-find-out-what-they-really-want/
http://www.lincs.fr/research/areas/big-data/
http://www.counsellingpages.co.uk/
http://langfordsconsultancy.com/langfords-training-support-package/
http://cbsepathshala.blogspot.in/2012/05/physics-class-x-chapter-electricity.html
http://mmatycoon.com/tycoontimes/tycoontimesstory.php?SID=1010
Introduction to Pig | Pig Architecture | Pig Fundamentals

Introduction to Pig | Pig Architecture | Pig Fundamentals

  • 1.
    Slide 1© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Big Data Analytics using Pig
  • 2.
    Slide 2© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Scope of PPT – BIG Data Analytics via PIG ᗍ Introduction to Big Data and Hadoop ᗍ Introduction to Pig ᗍ Hadoop Pig Architecture ᗍ BIG Data Analytics via Pig ᗍ BIG Data & Hadoop Job Trends ᗍ BIG Data & Hadoop Course Syllabus Get Started with BIG Data & Hadoop
  • 3.
    Slide 3© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Big Data and its Challenges Get Started with BIG Data & Hadoop
  • 4.
    Slide 4© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Big Data and its Challenges Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information It’s very difficult to manage such huge data…… Get Started with BIG Data & Hadoop
  • 5.
    Slide 5© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Who Generates Big Data? Have you ever wondered how Google, Facebook or LinkedIn manages to store and utilize the huge data? Today, it is becoming a problem for all of us to manage such BIG DATA…. Get Started with BIG Data & Hadoop
  • 6.
    Slide 6© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Hadoop can be used for easy processing of such huge Data….. We will answer how? Before that let’s understand what is Hadoop? Get Started with BIG Data & Hadoop
  • 7.
    Slide 7© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Hadoop and its Characteristics Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model It is an Open-source Data Management technology with scale-out storage and distributed processing Hadoop Characteristics Flexible Reliable Economical Scalable Get Started with BIG Data & Hadoop
  • 8.
    Slide 8© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework HBase Other YARN Frameworks (MPI, GIRAPH) YARN Cluster Resource Management Hadoop Ecosystem Get Started with BIG Data & Hadoop
  • 9.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 9© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Need for Pig Java is not a preferred language for many data analysts 200 Java LOC ~ 10 Pig LOC Many built-in operations are available for common data operations like join, grouping, filtering etc. Get Started with BIG Data & Hadoop
  • 10.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 10© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Where to use Pig? Pig is a Data Flow language, thus it is most suitable for: ᗍ Quickly changing data processing requirements ᗍ Processing data from multiple channels ᗍ Quick hypothesis testing ᗍ Time sensitive data refreshes ᗍ Data profiling using sampling Get Started with BIG Data & Hadoop
  • 11.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 11© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com What is Pig? ᗍ It is an open source data flow language ᗍ Pig Latin is used to express the queries and data manipulation operations in simple scripts ᗍ Pig converts the scripts into a sequence of underlying Map Reduce jobs Get Started with BIG Data & Hadoop
  • 12.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 12© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Let’s internalize Pig Let’s find out people who “overall” visit “highly ranked” pages User URL Time John www.cbn.com 7:00 John www.trap.com 7:05 John www.myblog.com 9:00 John www.flickr.com 9:05 Linda cnn.com/index.htm 11:00 Visits Page URL Page Rank www.cbn.com 0.9 www.flickr.com 0.9 www.myblog.com 0.6 www.trap.com 0.3 Pages Get Started with BIG Data & Hadoop
  • 13.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 13© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Internalizing Pig Join url = url Load Visits (user, url, time) Load Pages (url, pagerank) Group by User Compute Average Pagerank Group by User Get Started with BIG Data & Hadoop
  • 14.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 14© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Pig in Industry Since Pig is a data flow language, it naturally suits for: ᗍ Data factory operations ᗍ Typically data is brought from multiple servers to HDFS ᗍ Pig is used for cleaning the data and preprocessing it ᗍ It helps data analysts and researchers for quickly prototyping their theories ᗍ Since Pig is extensible, it becomes way easier for data analysts to spawn their scripting language programs (like Ruby, Python programs) effectively against large data sets Get Started with BIG Data & Hadoop
  • 15.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 15© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Ways to Handle Pig ᗍ Grunt Mode: • It’s interactive mode of Pig • Very useful for testing syntax checking and ad-hoc data exploration ᗍ Script Mode: • Runs set of instructions from a file • Similar to a SQL script file ᗍ Embedded Mode: • Executes Pig programs from a Java program • Suitable to create Pig Scripts on the fly Script Grunt Embedded Get Started with BIG Data & Hadoop
  • 16.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 16© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Modes of Pig All of the different Pig invocations can run in the following modes: Local ᗍ In this mode, entire Pig job runs as a single JVM process ᗍ Picks and stores data from local Linux path Map Reduce ᗍ In this mode, Pig job runs as a series of map reduce jobs ᗍ Input and output paths are assumed as HDFS paths Get Started with BIG Data & Hadoop
  • 17.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 17© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Pig Components Pig Data Flows Pig Latin is used to express data flows Execution Environments Distributed execution on a Hadoop Cluster Local execution in a single JVM 1. 2. Get Started with BIG Data & Hadoop
  • 18.
    © 2015 BlueCamphor Technologies (P) Ltd. Slide 18© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Pig is just a wrapper on top of Map Reduce layer It parses, optimizes and converts the Pig script to a series of Map Reduce jobs Pig A series of MapReduce Jobs Turns the transformations into… Pig Programs Execution Get Started with BIG Data & Hadoop
  • 19.
    Slide 19© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Job Trends – Hadoop Get Started with BIG Data & Hadoop
  • 20.
    Slide 20© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Why SkillSpeed? Course Curriculum from Industry Experts Instructor Led Live Virtual Sessions Lifetime access to Course Content via LMS 100% Placement Assistance 24x7 Support Get Started with BIG Data & Hadoop
  • 21.
    Slide 21© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Course Topics Module 1 Introduction to Big Data and Hadoop Module 2 HDFS Internals, Hadoop Configurations and Data Loading Module 3 Introduction to Map Reduce Module 4 Advanced Map Reduce Concepts Module 5 Introduction to Pig Module 6 Advanced Pig and Introduction to Hive Module 7 Advanced Hive Concepts Module 8 Extending Hive and HBase Introduction Module 9 Advanced HBase and Oozie Introduction Module 10 Project Set-up Discussion Get Started with BIG Data & Hadoop
  • 22.
    Slide 22© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Corporate Partners Get Started with BIG Data & Hadoop
  • 23.
    Slide 23© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Lines open 24/7 To know more about the course, Please contact: IND +91-90660-20904 USA 1866-607-6547 (Toll Free) Or reach us at sales@skillspeed.com Contact Us Get Started with BIG Data & Hadoop
  • 24.
    Slide 24© 2015BlueCamphor Technologies (P) Ltd. www.skillspeed.com Image References Google images – credit for google, Facebook and LinkedIn LOGO and Snapshots http://pixshark.com/big-data-comic.htm http://findicons.com/icon/66444/user_group http://www.virtualizor.com/tour https://accounts.it.et.byu.edu/ http://www.clipartsfree.net/tag/server.html http://www.gopixpic.com/16/time-clock-icon-png-download http://blog.smartbear.com/requirements/how-to-interview-users-to-find-out-what-they-really-want/ http://www.lincs.fr/research/areas/big-data/ http://www.counsellingpages.co.uk/ http://langfordsconsultancy.com/langfords-training-support-package/ http://cbsepathshala.blogspot.in/2012/05/physics-class-x-chapter-electricity.html http://mmatycoon.com/tycoontimes/tycoontimesstory.php?SID=1010

Editor's Notes

  • #21 SkillSpeed offer virtual instructor lead courses designed to bridge the time to competency gap experienced by the technology companies. USP of SkillSpeed is the subject matter expert (SME). SMEs are industry experts and has a good understanding and hands-on industry experience of the technology. This industry expert designs, develops, and delivers the course. SkillSpeed provides you: Course Curriculum from Industry Experts Instructor Led Live Virtual Sessions Real life industry case studies  - Live Virtual Interactions Interaction with industry experts  - Lifetime access to all course content via the LMS   - 24*7 support   - 100% placement assistance