SlideShare a Scribd company logo
1 of 25
HDFS, MapReduce and
Apache Pig Tutorial
Pranamesh Chakraborty
Resources: CPRE-419 course in ISU (Large Scale Data Analysis)
Hadoop
Job Software available
Data Storage
Hadoop Distributed File
Storage (HDFS)
Parallel Processing MapReduce
Scripting, SQL Pig, Hive
HDFS
What Problem Does HDFS Solve?
Storing Large Data
 A single file as large as a Petabyte
 Store a single file across many machines in a
cluster
 Tolerant to failure of one or more nodes
 High throughput parallel access to data
HDFS
HDFS does not work for
 Low latency random access
 Files that need to be modified at runtime
HDFS
How HDFS work?
 Consider a Large File, multiple GB
 File Divided into Blocks
 Each Block is given an identifier
 Block size typically 64MB
 Blocks kept on different machines
• Leads to a higher throughput of data
HDFS
How HDFS work?
 Replicate Blocks
• Fault tolerance, guard against data loss and corruption
• Default is 3-fold replication, but configurable per file
• Individual blocks are replicated
HDFS Architecture
Namenode and Datanodes
 One namenode, many datanodes
 “Master-slave” architecture
 Namenode stores Metadata
 Datanodes store actual data blocks
MapReduce
Programming model for large Scale Data
Processing
• First used in the context of “big data” in a system
from Google: “MapReduce: simplified data
processing on large clusters”, Jeffrey Dean and
Sanjay Ghemawat, Google Inc.
• Programmer Describes Computation in two steps, Map
and Reduce
MapReduce Example: Word Count
 Problem: Count the number of occurrences of each
word within a text corpus, and output them to a file
 Input: text corpus (say all words from the New York
Times archives), a file in HDFS
 Output: for each unique word in the corpus, the
number of occurrences of the word
MapReduce Example: Word Count
Always think in the notion of (key, value) pairs
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Example: Word Count
MapReduce Parallelization
 Different Map Steps can run in parallel
 All Map steps must complete before any Reduce
step begins
 Different Reduce Steps can run in parallel
 Automatic parallelization of a MapReduce
program
Apache Pig
 Framework for large scale data processing, at a
higher level of abstraction than MapReduce.
Writes programs faster than MapReduce for
processing large datasets
Apache Pig
Apache Pig
Resources:
Reference Manual:
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html
Built-in functions
https://pig.apache.org/docs/r0.11.1/func.html
Common hdfs commands
Starts with hdfs dfs -….
or hadoop fs -….
• See the contents of a folder:
• hdfs dfs –ls <location>
• Copy from Local machine to HDFS
• First copy the required file to the local machine
via WinScp
• hdfs dfs –copyFromLocal <local machine
location> <location in HDFS>
Common hdfs commands
• Copy to Local machine from HDFS
• hdfs dfs –copyToLocal <local machine location>
<location in HDFS>
• Then copy the required file from the local
machine to your machine via WinScp
Common hdfs commands
• Make a new directory in hdfs:
• hdfs dfs –mkdir <hdfs directory location>
• See the tail of a file in hdfs:
• hdfs dfs –tail <hdfs file location>
• See the top of a file in hdfs:
• hdfs dfs –cat <hdfs file name>|head -10
Pig Script
A sample script on INRIX XD Data
Inrix XD data Schema:
code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
Pig Script
Inrix XD data Schema:
segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
 Problem: Count the number of occurrences of
confidence score = 30 for any 10 segments for June
23rd 2016 Inrix XD data, and output them to a file
Pig Script
Run the script:
pig –x tez <script location in Local machine>
Store the output in local machine
hdfs dfs –getmerge <hdfs location> <local machine location>

More Related Content

What's hot

Dhcp in linux
Dhcp in linuxDhcp in linux
Dhcp in linuxUc Man
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commandsbispsolutions
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at FacebookRedis Labs
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 

What's hot (20)

Dhcp in linux
Dhcp in linuxDhcp in linux
Dhcp in linux
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
HDFS Tiered Storage
HDFS Tiered StorageHDFS Tiered Storage
HDFS Tiered Storage
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop
HadoopHadoop
Hadoop
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
Using Redis at Facebook
Using Redis at FacebookUsing Redis at Facebook
Using Redis at Facebook
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 

Similar to HDFS, MapReduce, Apache Pig Tutorial

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Similar to HDFS, MapReduce, Apache Pig Tutorial (20)

Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Presentation
PresentationPresentation
Presentation
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from Pranamesh Chakraborty

Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...
Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...
Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...Pranamesh Chakraborty
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationPranamesh Chakraborty
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationPranamesh Chakraborty
 
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORT
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORTSAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORT
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORTPranamesh Chakraborty
 
Schedule determination of a multiple route transit system
Schedule determination of a multiple  route transit systemSchedule determination of a multiple  route transit system
Schedule determination of a multiple route transit systemPranamesh Chakraborty
 

More from Pranamesh Chakraborty (13)

Leaning Tower Pisa
Leaning Tower PisaLeaning Tower Pisa
Leaning Tower Pisa
 
Classification of Soil
Classification of SoilClassification of Soil
Classification of Soil
 
Canal irrigation
Canal irrigationCanal irrigation
Canal irrigation
 
Rainfall measurement methods
Rainfall measurement methodsRainfall measurement methods
Rainfall measurement methods
 
Types of Irrigation
Types of IrrigationTypes of Irrigation
Types of Irrigation
 
Ig Nobel Prize
Ig Nobel PrizeIg Nobel Prize
Ig Nobel Prize
 
Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...
Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...
Critical Appraisal of Pavement Design of Ohio Department of Transportation (O...
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimization
 
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORT
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORTSAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORT
SAFETY MEASURES FOR TABLE-TOP RUNWAY OF MANGALORE AIRPORT
 
Schedule determination of a multiple route transit system
Schedule determination of a multiple  route transit systemSchedule determination of a multiple  route transit system
Schedule determination of a multiple route transit system
 
Canal Irrigation
Canal IrrigationCanal Irrigation
Canal Irrigation
 
The Traitor?
The Traitor?The Traitor?
The Traitor?
 

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 

HDFS, MapReduce, Apache Pig Tutorial

  • 1. HDFS, MapReduce and Apache Pig Tutorial Pranamesh Chakraborty Resources: CPRE-419 course in ISU (Large Scale Data Analysis)
  • 2. Hadoop Job Software available Data Storage Hadoop Distributed File Storage (HDFS) Parallel Processing MapReduce Scripting, SQL Pig, Hive
  • 3. HDFS What Problem Does HDFS Solve? Storing Large Data  A single file as large as a Petabyte  Store a single file across many machines in a cluster  Tolerant to failure of one or more nodes  High throughput parallel access to data
  • 4. HDFS HDFS does not work for  Low latency random access  Files that need to be modified at runtime
  • 5. HDFS How HDFS work?  Consider a Large File, multiple GB  File Divided into Blocks  Each Block is given an identifier  Block size typically 64MB  Blocks kept on different machines • Leads to a higher throughput of data
  • 6. HDFS How HDFS work?  Replicate Blocks • Fault tolerance, guard against data loss and corruption • Default is 3-fold replication, but configurable per file • Individual blocks are replicated
  • 7. HDFS Architecture Namenode and Datanodes  One namenode, many datanodes  “Master-slave” architecture  Namenode stores Metadata  Datanodes store actual data blocks
  • 8. MapReduce Programming model for large Scale Data Processing • First used in the context of “big data” in a system from Google: “MapReduce: simplified data processing on large clusters”, Jeffrey Dean and Sanjay Ghemawat, Google Inc. • Programmer Describes Computation in two steps, Map and Reduce
  • 9. MapReduce Example: Word Count  Problem: Count the number of occurrences of each word within a text corpus, and output them to a file  Input: text corpus (say all words from the New York Times archives), a file in HDFS  Output: for each unique word in the corpus, the number of occurrences of the word
  • 10. MapReduce Example: Word Count Always think in the notion of (key, value) pairs
  • 16. MapReduce Parallelization  Different Map Steps can run in parallel  All Map steps must complete before any Reduce step begins  Different Reduce Steps can run in parallel  Automatic parallelization of a MapReduce program
  • 17. Apache Pig  Framework for large scale data processing, at a higher level of abstraction than MapReduce. Writes programs faster than MapReduce for processing large datasets
  • 20. Common hdfs commands Starts with hdfs dfs -…. or hadoop fs -…. • See the contents of a folder: • hdfs dfs –ls <location> • Copy from Local machine to HDFS • First copy the required file to the local machine via WinScp • hdfs dfs –copyFromLocal <local machine location> <location in HDFS>
  • 21. Common hdfs commands • Copy to Local machine from HDFS • hdfs dfs –copyToLocal <local machine location> <location in HDFS> • Then copy the required file from the local machine to your machine via WinScp
  • 22. Common hdfs commands • Make a new directory in hdfs: • hdfs dfs –mkdir <hdfs directory location> • See the tail of a file in hdfs: • hdfs dfs –tail <hdfs file location> • See the top of a file in hdfs: • hdfs dfs –cat <hdfs file name>|head -10
  • 23. Pig Script A sample script on INRIX XD Data Inrix XD data Schema: code, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime
  • 24. Pig Script Inrix XD data Schema: segment id, speed, time, confidence score, cvalue, avg speed, reference speed, traveltime  Problem: Count the number of occurrences of confidence score = 30 for any 10 segments for June 23rd 2016 Inrix XD data, and output them to a file
  • 25. Pig Script Run the script: pig –x tez <script location in Local machine> Store the output in local machine hdfs dfs –getmerge <hdfs location> <local machine location>