SlideShare a Scribd company logo
1 of 30
Introduction to Hadoop and
MapReduce
Csaba Toth
Central California .NET User Group
Meeting
Date: March 19th, 2014
Location: Bitwise Industries, Fresno
Agenda
• Big Data
• Hadoop
• Map Reduce
• Demos:
– Exercises with on-premise Hadoop emulator
– Azure HDInsight
Big Data
• Wikipedia: “collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing applications”
• Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
– 100 Terabytes of data is uploaded to Facebook every day
– Facebook Stores, Processes, and Analyzes more than 30 Petabytes of
user generated data
– Twitter generates 12 Terabytes of data every day
– LinkedIn processes and mines Petabytes of user data to power the
"People You May Know" feature
– YouTube users upload 48 hours of new video content every minute of
the day
– Decoding of the human genome used to take 10 years. Now it can be
done in 7 days
Big Data characteristics
• Three Vs: Volume, Velocity, Variety
• Sources:
– Science, Sensors, Social networks, Log files
– Public Data Stores, Data warehouse appliances
– Network and in-stream monitoring technologies
– Legacy documents
• Main problems:
– Storage Problem
– Money Problem
– Consuming and processing the data
A Little History
Two Seminar papers:
• “The Google File System” - October 2003
http://labs.google.com/papers/gfs.html
– describes a scalable, distributed, fault-tolerant file system
tailored for data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate performance
• “MapReduce: Simplified Data Processing on Large Clusters”
- April 2004 http://queue.acm.org/detail.cfm?id=988408
– Describes a programming model and an implementation for
processing large data sets.
1. map function that processes a key/value pair to generate a set of
intermediate key/value pairs
2. reduce function that merges all intermediate values associated with
the same intermediate key
Hadoop vs RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Interactive and Batch Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate
Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications.
• Has two main pieces:
– Storing large amounts of data: HDFS, Hadoop
Distributed File System
– Processing large amounts of data: implementation
of the MapReduce programming model
Hadoop
• All of this in a cost effective way: Hadoop is
managing a cluster of commodity hardware
computers.
• The cluster is composed of a single master
node and multiple worker nodes
• It is written in Java, utilizes JVMs
Scalable Storage – Hadoop Distributed
File System
• Store large amounts of data economically
• Scaling out instead of scaling up
– Each file is split into large blocks (64MB by default)
– Each block is replicated to multiple machines (nodes)
– A centralized meta-data store has information on the
blocks
– Each node can fail
• Allows storage of large files in fault-tolerant
distributed manner
Name node
HDFS visually
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Scalable Processing
• Task: word counting program
• Pseudo-code:
1. Open text file for read, read each line
2. Parse each line into words
3. Increment the count of each word in a dictionary
4. Close the file and output the summary
Scalable Processing
• Issues:
– Data storage
– Data distribution
– Parallelizable algorithm
– Fault tolerance
– Aggregation
– Storage of results
Scalable Processing
Issue Considered The Hadoop solution
Data storage HDFS
Data distribution—The system should be able
to distribute data to each of the processing
nodes.
Hadoop keeps data distribution between nodes
to a minimum. It instead moves processing
code to each node and processes the data
where it is already available on disk.
Parallelizable algorithm Hadoop uses the MapReduce programming
model to enable a scalable processing model.
Fault tolerance Hadoop monitors data storage nodes and will
add replicas as nodes become unavailable.
Hadoop also monitors tasks assigned to nodes
and will reassign if a node becomes
unavailable.
Aggregation This is accounted for in a distributed manner
through the Reduce stage
Storage of results HDFS
MapReduce
• Hadoop leverages the functional programming model
of map/reduce.
• Moves away from shared resources and related
synchronization and contention issues
• Thus inherently scalable and suitable for processing
large data sets, distributed computing on clusters of
computers/nodes.
• The goal of map reduce is to break huge data sets into
smaller pieces, distribute those pieces to various
worker nodes, and process the data in parallel.
• Hadoop leverages a distributed file system to store the
data on various nodes.
Name node
Heart beat signals and
communication
Job / task management
Jobtracker
Data node Data node Data node
Tasktracker Tasktracker
Map 1 Reduce 1 Map 2 Reduce 2
Tasktracker
Map 3 Reduce 3
MapReduce
• It is about two functions: map and reduce
1. Map Step:
– Processes a key/value pairs and generate a set of
intermediate key/value pairs form that
2. Shuffle step:
– Groups all intermediate values associated with the
same intermediate key into one set
3. Reduce Step:
– Processes the intermediate values associated with the
same intermediate key and produces a set of values
based on the groups (usually some kind of aggregate)
MapReduce – Map step, word count
• Input
– Key: simply the starting index of the text chunk within
the block being processed – ignored
– Value: Single line of text – the unit we want to process
• Output (implemented by user)
– Key: A word (of a line)
– Value: Multiplicity of the word – this will be 1 for each
encountered word (we want to SUM() these 1s
together for each word to get the total count later)
– For each input (line) we output a whole set of words,
each with the multiplicity of 1.
Map step, word count example
• Input to mapper
• Output by Mapper
Key Value
Some ignored offset number “Twinkle, Twinkle Little Star”
Key Value
Twinkle 1
Twinkle 1
Little 1
Star 1
MapReduce – Shuffle step, word count
• Once the Map stage is over, data collected
from the Mappers
• During the Shuffle stage, all values that have
the same key are collected and stored as a
conceptual list tied to the key under which
they were registered.
• Guarantees that data under a specific key will
be sent to exactly one reducer
Shuffle step, word count example
• Output by Shuffle
Key Value
Twinkle 1,1
Little 1
Star 1
MapRed – Reduce step, word count
• Input (inferred from map output)
– Key: A word
– Value: List of multiplicities
• Output (implemented by user)
– Key: A word
– Value: SUM() of the multiplicities, thus the
aggregate count
Reduce step, word count example
• Output by Reduce
Key Value
Twinkle 2
Little 1
Star 1
MapReduce – Reduce step
• Aggregates the answers and creates the
needed output, which is the answer to the
problem it was originally trying to solve.
• That is how the system can process petabytes
in a matter of hours:
– Mappers can preform the map phase in parallel
– Reducers can preform the reduction phase in
parallel
– Hadoop optimizes data and processing locality
Word count
http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map, Shuffle, and Reduce
https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Demo
• Simple implementation of word count
• Hortonworks HDInsight on-premise (single
node, pseudo cluster)
• Hadoop SDK, building from source or from
NuGet
• Mapper source
• Reducer source
Demo
• HDInsight on Azure
Hadoop architecture
Log Data RDBMS
Data Integration Layer
Flume Sqoop
Storage Layer (HDFS)
Computing Layer (MapReduce)
Advanced Query Engine (Hive, Pig)
Data Mining
(Pegasus,
Mahout)
Index,
Searches
(Lucene)
DB drivers
(Hive driver)
Web Browser (JS)Presentation
Layer
References
• Daniel Jebaraj: Ignore HDInsight at Your Own
Peril: Everything You Need to Know
• Tom White: Hadoop: The Definitive Guide, 3rd
Edition, Yahoo Press
• Bruno Terkaly’s presentations (for example
Hadoop on Azure: Introduction)
• Lynn Langit’s various presentations and
YouTube videos
Thanks for your attention!
• Next time:
– Pig and Hive
– Recommendation engine
– Data mining with Hadoop

More Related Content

What's hot

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtechJakir Hossain
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorialAamir Ameen
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataSharad Pandey
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)Xavier Constant
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basicChirag Ahuja
 

What's hot (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 

Similar to Hadoop and Mapreduce for .NET User Group

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
ENAR short course
ENAR short courseENAR short course
ENAR short courseDeepak Agarwal
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce ParadigmTarjMehta1
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptxSakthiVinoth78
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentalsLokesh Ramaswamy
 

Similar to Hadoop and Mapreduce for .NET User Group (20)

Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce Paradigm
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 

More from Csaba Toth

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesCsaba Toth
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP DemoCsaba Toth
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of NetworksCsaba Toth
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 previewCsaba Toth
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkCsaba Toth
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of javaCsaba Toth
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute EngineCsaba Toth
 
Google App Engine
Google App EngineGoogle App Engine
Google App EngineCsaba Toth
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteCsaba Toth
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCsaba Toth
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineCsaba Toth
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application developmentCsaba Toth
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingCsaba Toth
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG editionCsaba Toth
 

More from Csaba Toth (17)

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Hadoop and Mapreduce for .NET User Group

  • 1. Introduction to Hadoop and MapReduce Csaba Toth Central California .NET User Group Meeting Date: March 19th, 2014 Location: Bitwise Industries, Fresno
  • 2. Agenda • Big Data • Hadoop • Map Reduce • Demos: – Exercises with on-premise Hadoop emulator – Azure HDInsight
  • 3. Big Data • Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” • Examples: (Wikibon - A Comprehensive List of Big Data Statistics) – 100 Terabytes of data is uploaded to Facebook every day – Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data – Twitter generates 12 Terabytes of data every day – LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature – YouTube users upload 48 hours of new video content every minute of the day – Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  • 4. Big Data characteristics • Three Vs: Volume, Velocity, Variety • Sources: – Science, Sensors, Social networks, Log files – Public Data Stores, Data warehouse appliances – Network and in-stream monitoring technologies – Legacy documents • Main problems: – Storage Problem – Money Problem – Consuming and processing the data
  • 5. A Little History Two Seminar papers: • “The Google File System” - October 2003 http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance • “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 http://queue.acm.org/detail.cfm?id=988408 – Describes a programming model and an implementation for processing large data sets. 1. map function that processes a key/value pair to generate a set of intermediate key/value pairs 2. reduce function that merges all intermediate values associated with the same intermediate key
  • 6. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
  • 7. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • Has two main pieces: – Storing large amounts of data: HDFS, Hadoop Distributed File System – Processing large amounts of data: implementation of the MapReduce programming model
  • 8. Hadoop • All of this in a cost effective way: Hadoop is managing a cluster of commodity hardware computers. • The cluster is composed of a single master node and multiple worker nodes • It is written in Java, utilizes JVMs
  • 9. Scalable Storage – Hadoop Distributed File System • Store large amounts of data economically • Scaling out instead of scaling up – Each file is split into large blocks (64MB by default) – Each block is replicated to multiple machines (nodes) – A centralized meta-data store has information on the blocks – Each node can fail • Allows storage of large files in fault-tolerant distributed manner
  • 10. Name node HDFS visually Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  • 11. Scalable Processing • Task: word counting program • Pseudo-code: 1. Open text file for read, read each line 2. Parse each line into words 3. Increment the count of each word in a dictionary 4. Close the file and output the summary
  • 12. Scalable Processing • Issues: – Data storage – Data distribution – Parallelizable algorithm – Fault tolerance – Aggregation – Storage of results
  • 13. Scalable Processing Issue Considered The Hadoop solution Data storage HDFS Data distribution—The system should be able to distribute data to each of the processing nodes. Hadoop keeps data distribution between nodes to a minimum. It instead moves processing code to each node and processes the data where it is already available on disk. Parallelizable algorithm Hadoop uses the MapReduce programming model to enable a scalable processing model. Fault tolerance Hadoop monitors data storage nodes and will add replicas as nodes become unavailable. Hadoop also monitors tasks assigned to nodes and will reassign if a node becomes unavailable. Aggregation This is accounted for in a distributed manner through the Reduce stage Storage of results HDFS
  • 14. MapReduce • Hadoop leverages the functional programming model of map/reduce. • Moves away from shared resources and related synchronization and contention issues • Thus inherently scalable and suitable for processing large data sets, distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
  • 15. Name node Heart beat signals and communication Job / task management Jobtracker Data node Data node Data node Tasktracker Tasktracker Map 1 Reduce 1 Map 2 Reduce 2 Tasktracker Map 3 Reduce 3
  • 16. MapReduce • It is about two functions: map and reduce 1. Map Step: – Processes a key/value pairs and generate a set of intermediate key/value pairs form that 2. Shuffle step: – Groups all intermediate values associated with the same intermediate key into one set 3. Reduce Step: – Processes the intermediate values associated with the same intermediate key and produces a set of values based on the groups (usually some kind of aggregate)
  • 17. MapReduce – Map step, word count • Input – Key: simply the starting index of the text chunk within the block being processed – ignored – Value: Single line of text – the unit we want to process • Output (implemented by user) – Key: A word (of a line) – Value: Multiplicity of the word – this will be 1 for each encountered word (we want to SUM() these 1s together for each word to get the total count later) – For each input (line) we output a whole set of words, each with the multiplicity of 1.
  • 18. Map step, word count example • Input to mapper • Output by Mapper Key Value Some ignored offset number “Twinkle, Twinkle Little Star” Key Value Twinkle 1 Twinkle 1 Little 1 Star 1
  • 19. MapReduce – Shuffle step, word count • Once the Map stage is over, data collected from the Mappers • During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list tied to the key under which they were registered. • Guarantees that data under a specific key will be sent to exactly one reducer
  • 20. Shuffle step, word count example • Output by Shuffle Key Value Twinkle 1,1 Little 1 Star 1
  • 21. MapRed – Reduce step, word count • Input (inferred from map output) – Key: A word – Value: List of multiplicities • Output (implemented by user) – Key: A word – Value: SUM() of the multiplicities, thus the aggregate count
  • 22. Reduce step, word count example • Output by Reduce Key Value Twinkle 2 Little 1 Star 1
  • 23. MapReduce – Reduce step • Aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve. • That is how the system can process petabytes in a matter of hours: – Mappers can preform the map phase in parallel – Reducers can preform the reduction phase in parallel – Hadoop optimizes data and processing locality
  • 25. Map, Shuffle, and Reduce https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 26. Demo • Simple implementation of word count • Hortonworks HDInsight on-premise (single node, pseudo cluster) • Hadoop SDK, building from source or from NuGet • Mapper source • Reducer source
  • 28. Hadoop architecture Log Data RDBMS Data Integration Layer Flume Sqoop Storage Layer (HDFS) Computing Layer (MapReduce) Advanced Query Engine (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Web Browser (JS)Presentation Layer
  • 29. References • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction) • Lynn Langit’s various presentations and YouTube videos
  • 30. Thanks for your attention! • Next time: – Pig and Hive – Recommendation engine – Data mining with Hadoop

Editor's Notes

  1. consists of very large volumes of heterogeneous data that is being generated, often, at high speeds.  These data sets cannot be managed and processed using traditional data management tools and applications at hand.  Big Data requires the use of a new set of tools, applications and frameworks to process and manage the data.
  2. However, it is not just about the total size of data (volume)It is also about the velocity (how rapidly is the data arriving)What is the structure? Does it have variations?ScienceScientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. SensorsData sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.Social networksI am thinking of Facebook, LinkedIn, Yahoo, GoogleSocial influencersBlog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etcLog filesComputer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliancesPublic Data StoresMicrosoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDbData warehouse appliancesTeradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis Network and in-stream monitoring technologiesPackets in TCP/IP, email, etcLegacy documentsArchives of statements, insurance forms, medical record and customer correspondence VolumeVolume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly.  This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world.VelocityVelocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today's competitive world, decision makers want the necessary data/information in the least amount of time as possible.  Generally, in near real time or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world.VarietyVariety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today's world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn't have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today's world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like clickstream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, censor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world.
  3. The Google File System It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clientshttp://research.google.com/archive/gfs.htmlMapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keyhttp://research.google.com/archive/mapreduce.html
  4. HDFS is a file system designed to store large amounts of data economically. It does this by storing dataon multiple commodity machines—scaling out instead of scaling up.HDFS, for all its underlying magic, is simple to understand at a conceptual level. Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting isconfigurable). Each block is then stored on multiple machines that are part of the HDFS cluster. A centralizedmetadata store has information on where individual parts of a file are stored. Considering that HDFS is implemented on commodity hardware, machines and disks areexpected to fail. When a node fails, HDFS will ensure that data blocks the node held arereplicated to other systems.This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.
  5. we discuss MapReduce, it will be helpful to carefully consider the issues associated with scalingout the processing of data across multiple machines. We will do this using a simple example. Assume wehave a text file, and we would like an individual count of all words that appear in that text file.This is the pseudo-code for a simple word-counting program that runs on a single machine:ď‚· Open the text file for reading, and read each line.ď‚· Parse each line into words.ď‚· Increment and store the count of each word as it appears in a dictionary or similar structure.ď‚· Close the file and output summary information.Simple enough. Now consider that you have several gigabytes (maybe petabytes) of text files. How willwe modify the simple program described above to process this kind of information by scaling out3 acrossmultiple machines?
  6. Some issues to consider: Data storage—The system should provide a way to store the data being processed. Data distribution—The system should be able to distribute data to each of the processingnodes.Parallelizable algorithm—The processing algorithm should be parallelizable. Each nodeshould be able to run without being held up by another during any given stage. If nodeshave to share data, the difficulties associated with the synchronization of such data are tobe considered. Fault tolerance—Any of the nodes processing data can fail. In order for the system to beresilient given the failure of individual nodes, the failure should be promptly detected andthe work should be delegated to another available node. Aggregation—The system should have a way to aggregate results produced by individualnodes to compute final summaries. Storage of results—The final output can itself be substantial; there should be an effectiveway to store and retrieve it.