HDFS, MapReduce, Apache Pig Tutorial

•Download as PPTX, PDF•

0 likes•410 views

This document provides an overview of Hadoop Distributed File System (HDFS), MapReduce, and Apache Pig. It describes how HDFS stores and replicates large files across clusters of machines for high throughput access. MapReduce is introduced as a programming model for processing large datasets in parallel. Word count is used as an example MapReduce job. Apache Pig is presented as a framework for analyzing large datasets with a higher level of abstraction than MapReduce. Finally, common HDFS commands and a sample Pig script are shown.

Data & Analytics

HDFS, MapReduce and
Apache Pig Tutorial
Pranamesh Chakraborty
Resources: CPRE-419 course in ISU (Large Scale Data Analysis)

Hadoop
Job Software available
Data Storage
Hadoop Distributed File
Storage (HDFS)
Parallel Processing MapReduce
Scripting, SQL Pig, Hive

HDFS
What Problem Does HDFS Solve?
Storing Large Data
 A single file as large as a Petabyte
 Store a single file across many machines in a
cluster
 Tolerant to failure of one or more nodes
 High throughput parallel access to data

HDFS
HDFS does not work for
 Low latency random access
 Files that need to be modified at runtime

HDFS
How HDFS work?
 Consider a Large File, multiple GB
 File Divided into Blocks
 Each Block is given an identifier
 Block size typically 64MB
 Blocks kept on different machines
• Leads to a higher throughput of data

HDFS
How HDFS work?
 Replicate Blocks
• Fault tolerance, guard against data loss and corruption
• Default is 3-fold replication, but configurable per file
• Individual blocks are replicated

HDFS Architecture
Namenode and Datanodes
 One namenode, many datanodes
 “Master-slave” architecture
 Namenode stores Metadata
 Datanodes store actual data blocks

MapReduce
Programming model for large Scale Data
Processing
• First used in the context of “big data” in a system
from Google: “MapReduce: simplified data
processing on large clusters”, Jeffrey Dean and
Sanjay Ghemawat, Google Inc.
• Programmer Describes Computation in two steps, Map
and Reduce

MapReduce Example: Word Count
 Problem: Count the number of occurrences of each
word within a text corpus, and output them to a file
 Input: text corpus (say all words from the New York
Times archives), a file in HDFS
 Output: for each unique word in the corpus, the
number of occurrences of the word

MapReduce Example: Word Count
Always think in the notion of (key, value) pairs

MapReduce Parallelization
 Different Map Steps can run in parallel
 All Map steps must complete before any Reduce
step begins
 Different Reduce Steps can run in parallel
 Automatic parallelization of a MapReduce
program

Apache Pig
 Framework for large scale data processing, at a
higher level of abstraction than MapReduce.
Writes programs faster than MapReduce for
processing large datasets

Apache Pig
Resources:
Reference Manual:
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
Built-in functions
https://pig.apache.org/docs/r0.11.1/func.html

Common hdfs commands
Starts with hdfs dfs -….
or hadoop fs -….
• See the contents of a folder:
• hdfs dfs –ls <location>
• Copy from Local machine to HDFS
• First copy the required file to the local machine
via WinScp
• hdfs dfs –copyFromLocal <local machine
location> <location in HDFS>

Common hdfs commands
• Copy to Local machine from HDFS
• hdfs dfs –copyToLocal <local machine location>
<location in HDFS>
• Then copy the required file from the local
machine to your machine via WinScp

Common hdfs commands
• Make a new directory in hdfs:
• hdfs dfs –mkdir <hdfs directory location>
• See the tail of a file in hdfs:
• hdfs dfs –tail <hdfs file location>
• See the top of a file in hdfs:
• hdfs dfs –cat <hdfs file name>|head -10

Pig Script
A sample script on INRIX XD Data
Inrix XD data Schema:
code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime

Pig Script
Inrix XD data Schema:
segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
 Problem: Count the number of occurrences of
confidence score = 30 for any 10 segments for June
23rd 2016 Inrix XD data, and output them to a file

Pig Script
Run the script:
pig –x tez <script location in Local machine>
Store the output in local machine
hdfs dfs –getmerge <hdfs location> <local machine location>

What's hot

Dhcp in linuxUc Man

Hadoop presentationChandra Sekhar Saripaka

Hadoop basic commandsbispsolutions

2011 06-30-hadoop-summit v5Samuel Rash

Hbase jddAndrzej Grzesik

HDFS: Hadoop Distributed FilesystemSteve Loughran

Hadoop operations basicHafizur Rahman

July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox

Hadoop technologySohini~~ Music

HDFS Tiered StorageDataWorks Summit/Hadoop Summit

Apache hadoop basicssaili mane

Hadoop architecture meetupvmoorthy

Hadoop training in hyderabad-kellytechnologiesKelly Technologies

HadoopCassell Hsu

Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup

JOSA TechTalks - Big Data on HadoopJordan Open Source Association

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar

JFall 2011 no sql workshopfvanvollenhoven

Using Redis at FacebookRedis Labs

Hadoop Fundamentalsits_skm

What's hot (20)

Dhcp in linux

Hadoop presentation

Hadoop basic commands

2011 06-30-hadoop-summit v5

Hbase jdd

HDFS: Hadoop Distributed Filesystem

Hadoop operations basic

July 2010 Triangle Hadoop Users Group - Chad Vawter Slides

Hadoop technology

HDFS Tiered Storage

Apache hadoop basics

Hadoop architecture meetup

Hadoop training in hyderabad-kellytechnologies

Hadoop

Quantcast File System (QFS) - Alternative to HDFS

JOSA TechTalks - Big Data on Hadoop

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)

JFall 2011 no sql workshop

Using Redis at Facebook

Hadoop Fundamentals

Similar to HDFS, MapReduce, Apache Pig Tutorial

Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23

Apache Hadoop Big Data TechnologyJay Nagar

Hadoop training by keylabsSiva Sankar

Hadoop fundamentalsInMobi Technology

Apache hadoop, hdfs and map reduce OverviewNisanth Simon

hadoopswatic018

Presentationch samaram

Hands on Hadoop and pigSudar Muthu

Hadoop TechnologyAtul Kushwaha

Hadoopavnishagr

Lecture 2 part 1Jazan University

HadoopDavid Xie

Hadoop HDFS by rohitkapakapa rohit

Hadoop distributed computing framework for big dataCyanny LIANG

Seminar pptRajatTripathi34

Hadoop and BigData - July 2016Ranjith Sekar

Presentation sreenu dwh-servicesSreenu Musham

Big Data Technologies - HadoopTalentica Software

Big Data and HadoopFlavio Vit

Similar to HDFS, MapReduce, Apache Pig Tutorial (20)

Topic 9a-Hadoop Storage- HDFS.pptx

Apache Hadoop Big Data Technology

Hadoop training by keylabs

Hadoop fundamentals

Apache hadoop, hdfs and map reduce Overview

hadoop

Presentation

Hands on Hadoop and pig

Hadoop Technology

Hadoop

Lecture 2 part 1

Hadoop

Hadoop HDFS by rohitkapa

Hadoop distributed computing framework for big data

Seminar ppt

Hadoop and BigData - July 2016

Presentation sreenu dwh-services

Big Data Technologies - Hadoop

Big Data and Hadoop

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

ASML's Taxonomy Adventure by Daniel Cantervoginip

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.

9654467111 Call Girls In Munirka Hotel And Home Service

Customer Service Analytics - Make Sense of All Your Data.pptx

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

20240419 - Measurecamp Amsterdam - SAM.pdf

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

ASML's Taxonomy Adventure by Daniel Canter

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

04242024_CCC TUG_Joins and Relationships

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Generative AI for Social Good at Open Data Science East 2024

Call Girls In Dwarka 9654467111 Escorts Service

HDFS, MapReduce, Apache Pig Tutorial

1. HDFS, MapReduce and Apache Pig Tutorial Pranamesh Chakraborty Resources: CPRE-419 course in ISU (Large Scale Data Analysis)

2. Hadoop Job Software available Data Storage Hadoop Distributed File Storage (HDFS) Parallel Processing MapReduce Scripting, SQL Pig, Hive

3. HDFS What Problem Does HDFS Solve? Storing Large Data  A single file as large as a Petabyte  Store a single file across many machines in a cluster  Tolerant to failure of one or more nodes  High throughput parallel access to data

4. HDFS HDFS does not work for  Low latency random access  Files that need to be modified at runtime

5. HDFS How HDFS work?  Consider a Large File, multiple GB  File Divided into Blocks  Each Block is given an identifier  Block size typically 64MB  Blocks kept on different machines • Leads to a higher throughput of data

6. HDFS How HDFS work?  Replicate Blocks • Fault tolerance, guard against data loss and corruption • Default is 3-fold replication, but configurable per file • Individual blocks are replicated

7. HDFS Architecture Namenode and Datanodes  One namenode, many datanodes  “Master-slave” architecture  Namenode stores Metadata  Datanodes store actual data blocks

8. MapReduce Programming model for large Scale Data Processing • First used in the context of “big data” in a system from Google: “MapReduce: simplified data processing on large clusters”, Jeffrey Dean and Sanjay Ghemawat, Google Inc. • Programmer Describes Computation in two steps, Map and Reduce

9. MapReduce Example: Word Count  Problem: Count the number of occurrences of each word within a text corpus, and output them to a file  Input: text corpus (say all words from the New York Times archives), a file in HDFS  Output: for each unique word in the corpus, the number of occurrences of the word

10. MapReduce Example: Word Count Always think in the notion of (key, value) pairs

11. MapReduce Example: Word Count

12. MapReduce Example: Word Count

13. MapReduce Example: Word Count

14. MapReduce Example: Word Count

15. MapReduce Example: Word Count

16. MapReduce Parallelization  Different Map Steps can run in parallel  All Map steps must complete before any Reduce step begins  Different Reduce Steps can run in parallel  Automatic parallelization of a MapReduce program

17. Apache Pig  Framework for large scale data processing, at a higher level of abstraction than MapReduce. Writes programs faster than MapReduce for processing large datasets

18. Apache Pig

19. Apache Pig Resources: Reference Manual: https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html Built-in functions https://pig.apache.org/docs/r0.11.1/func.html

20. Common hdfs commands Starts with hdfs dfs -…. or hadoop fs -…. • See the contents of a folder: • hdfs dfs –ls <location> • Copy from Local machine to HDFS • First copy the required file to the local machine via WinScp • hdfs dfs –copyFromLocal <local machine location> <location in HDFS>

21. Common hdfs commands • Copy to Local machine from HDFS • hdfs dfs –copyToLocal <local machine location> <location in HDFS> • Then copy the required file from the local machine to your machine via WinScp

22. Common hdfs commands • Make a new directory in hdfs: • hdfs dfs –mkdir <hdfs directory location> • See the tail of a file in hdfs: • hdfs dfs –tail <hdfs file location> • See the top of a file in hdfs: • hdfs dfs –cat <hdfs file name>|head -10

23. Pig Script A sample script on INRIX XD Data Inrix XD data Schema: code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime

24. Pig Script Inrix XD data Schema: segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime  Problem: Count the number of occurrences of confidence score = 30 for any 10 segments for June 23rd 2016 Inrix XD data, and output them to a file

25. Pig Script Run the script: pig –x tez <script location in Local machine> Store the output in local machine hdfs dfs –getmerge <hdfs location> <local machine location>

HDFS, MapReduce, Apache Pig Tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HDFS, MapReduce, Apache Pig Tutorial

Similar to HDFS, MapReduce, Apache Pig Tutorial (20)

More from Pranamesh Chakraborty

More from Pranamesh Chakraborty (13)

Recently uploaded

Recently uploaded (20)

HDFS, MapReduce, Apache Pig Tutorial