SlideShare a Scribd company logo
1 of 25
Introduction to Hadoop and
MapReduce
Csaba Toth
GDG Fresno Meeting
Date: February 6th, 2014
Location: The Hashtag, Fresno
Agenda
•
•
•
•
•

Big Data
A little history
Hadoop
Map Reduce
Demo: Hadoop with Google Compute Engine
and Google Cloud Storage
Big Data
• Wikipedia: “collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing applications”
• Examples: (Wikibon - A Comprehensive List of Big Data Statistics)
– 100 Terabytes of data is uploaded to Facebook every day
– Facebook Stores, Processes, and Analyzes more than 30 Petabytes of
user generated data
– Twitter generates 12 Terabytes of data every day
– LinkedIn processes and mines Petabytes of user data to power the
"People You May Know" feature
– YouTube users upload 48 hours of new video content every minute of
the day
– Decoding of the human genome used to take 10 years. Now it can be
done in 7 days
Big Data characteristics
• Three Vs: Volume, Velocity, Variety
• Sources:
–
–
–
–

Science, Sensors, Social networks, Log files
Public Data Stores, Data warehouse appliances
Network and in-stream monitoring technologies
Legacy documents

• Main problems:
– Storage Problem
– Money Problem
– Consuming and processing the data
A Little History
Two Seminar papers:
• “The Google File System” - October 2003
http://labs.google.com/papers/gfs.html
– describes a scalable, distributed, fault-tolerant file system
tailored for data-intensive applications, running on inexpensive
commodity hardware, delivers high aggregate performance

• “MapReduce: Simplified Data Processing on Large Clusters”
- April 2004 http://queue.acm.org/detail.cfm?id=988408
– Describes a programming model and an implementation for
processing large data sets.
1.

2.

map function that processes a key/value pair to generate a set of
intermediate key/value pairs
reduce function that merges all intermediate values associated with
the same intermediate key
Hadoop
• Hadoop is an open-source software framework
that supports data-intensive distributed
applications.
• It is written in Java, utilizes JVMs
• Named after it’s creator’s (Doug Cutting, Yahoo)
son’s toy elephant
• Hadoop is managing a cluster of commodity
hardware computers. The cluster is composed of
a single master node and multiple worker nodes
Hadoop vs RDBMS
Hadoop / MapReduce

RDBMS

Size of data

Petabytes

Gigabytes

Integrity of data

Low

High (referential, typed)

Data schema

Dynamic

Static

Access method

Interactive and Batch

Batch

Scaling

Linear

Nonlinear (worse than
linear)

Data structure

Unstructured

Structured

Normalization of data

Not Required

Required

Query Response Time

Has latency (due to batch
processing)

Can be near immediate
MapReduce
• Hadoop leverages the programming model of
map/reduce. It is optimized for processing large data
sets.
• MapReduce is an essential technique to do distributed
computing on clusters of computers/nodes.
• The goal of map reduce is to break huge data sets into
smaller pieces, distribute those pieces to various
worker nodes, and process the data in parallel.
• Hadoop leverages a distributed file system to store the
data on various nodes.
MapReduce
• It is about two functions: map and reduce
1. Map Step:
– it is about dividing the problem into smaller subproblems. A master node has the job of distributing
the work to worker nodes. The worker node just does
one thing and returns the work back to the master
node.

2. Reduce Step:
– Once the master gets the work from the worker
nodes, the reduce step takes over and combines all
the work. By combining the work you can form some
answer and ultimately output.
MapReduce – Map step
• There is a master node and many slave nodes.
• The master node takes the input, divides it into
smaller sub-problems, and distributes the input
to worker or slave nodes. worker node may do
this again in turn, leading to a multi-level tree
structure.
• The worker/slave nodes processes the data into a
smaller problem, and passes the answer back to
its master node.
• Each mapping operation is independent of the
others, all maps can be performed in parallel.
MapReduce – Reduce step
• The master node then collects the answers
from the worker or slave nodes. It then
aggregates the answers and creates the
needed output, which is the answer to the
problem it was originally trying to solve.
• Reducers can also preform the reduction
phase in parallel. That is how the system can
process petabytes in a matter of hours.
Map, Shuffle, and Reduce

https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Word count

http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Hadoop architecture
•
•
•
•

Job Tracker
Task Tracker
Name Node
Data Node
Figures
• Following: some figures from the book
Hadoop: The Definitive Guide, 3rd Edition
A client reading data from HDFS
A client writing data to HDFS
Network distance in Hadoop
MapReduce data flow with a single
reduce task
MapReduce data flow with multiple
reduce tasks
Hadoop architecture
Presentation
Layer

Web Browser (JS)

Data Mining
(Pegasus,
Mahout)

Index,
Searches
(Lucene)

DB drivers
(Hive driver)

Advanced Query Engine (Hive, Pig)
Computing Layer (MapReduce)
Storage Layer (HDFS)
Data Integration Layer
Flume

Sqoop

Log Data

RDBMS
Demo
• Google Compute Engine + Google Cloud Storage
• Using Ubuntu as a remote control host
• Following the tutorial of:
– https://github.com/GoogleCloudPlatform/solutions-google-computeengine-cluster-for-hadoop
– Hadoop on Google Compute Engine for Processing Big Data:
https://www.youtube.com/watch?v=se9vV8eIZME

• The example hadoop job is an advanced version of
word count in perl or python: the words are sorted by
length and abc
• Showing also Google Developer Tool web interface
References
• Google’s tutorial (see github and YouTube link of the
Demo)
• Tom White: Hadoop: The Definitive Guide, 3rd Edition,
Yahoo Press
• Lynn Langit’s various presentations and YouTube videos
• Dattatrey Sindol: Big Data Basics - Part 1 - Introduction
to Big Data
• Bruno Terkaly’s presentations (for example Hadoop on
Azure: Introduction)
• Daniel Jebaraj: Ignore HDInsight at Your Own Peril:
Everything You Need to Know
Thanks for your attention!

More Related Content

What's hot

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Hadoop
Hadoop Hadoop
Hadoop
 

Viewers also liked

Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database ChoicesLynn Langit
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasortpramodbiligiri
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Channy Yun
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 

Viewers also liked (8)

Not only SQL - Database Choices
Not only SQL - Database ChoicesNot only SQL - Database Choices
Not only SQL - Database Choices
 
Shuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop TerasortShuffle phase as the bottleneck in Hadoop Terasort
Shuffle phase as the bottleneck in Hadoop Terasort
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
Daum 내부 빅데이터 및 클라우드 기술 활용 사례- 윤석찬 (2012)
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 

Similar to Introduction to Hadoop and MapReduce

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce ParadigmTarjMehta1
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 

Similar to Introduction to Hadoop and MapReduce (20)

Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop MapReduce Paradigm
Hadoop MapReduce ParadigmHadoop MapReduce Paradigm
Hadoop MapReduce Paradigm
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Anju
AnjuAnju
Anju
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop
HadoopHadoop
Hadoop
 

More from Csaba Toth

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesCsaba Toth
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP DemoCsaba Toth
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of NetworksCsaba Toth
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 previewCsaba Toth
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkCsaba Toth
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of javaCsaba Toth
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute EngineCsaba Toth
 
Google App Engine
Google App EngineGoogle App Engine
Google App EngineCsaba Toth
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteCsaba Toth
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCsaba Toth
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineCsaba Toth
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application developmentCsaba Toth
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingCsaba Toth
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG editionCsaba Toth
 

More from Csaba Toth (17)

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Introduction to Hadoop and MapReduce

  • 1. Introduction to Hadoop and MapReduce Csaba Toth GDG Fresno Meeting Date: February 6th, 2014 Location: The Hashtag, Fresno
  • 2. Agenda • • • • • Big Data A little history Hadoop Map Reduce Demo: Hadoop with Google Compute Engine and Google Cloud Storage
  • 3. Big Data • Wikipedia: “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” • Examples: (Wikibon - A Comprehensive List of Big Data Statistics) – 100 Terabytes of data is uploaded to Facebook every day – Facebook Stores, Processes, and Analyzes more than 30 Petabytes of user generated data – Twitter generates 12 Terabytes of data every day – LinkedIn processes and mines Petabytes of user data to power the "People You May Know" feature – YouTube users upload 48 hours of new video content every minute of the day – Decoding of the human genome used to take 10 years. Now it can be done in 7 days
  • 4. Big Data characteristics • Three Vs: Volume, Velocity, Variety • Sources: – – – – Science, Sensors, Social networks, Log files Public Data Stores, Data warehouse appliances Network and in-stream monitoring technologies Legacy documents • Main problems: – Storage Problem – Money Problem – Consuming and processing the data
  • 5. A Little History Two Seminar papers: • “The Google File System” - October 2003 http://labs.google.com/papers/gfs.html – describes a scalable, distributed, fault-tolerant file system tailored for data-intensive applications, running on inexpensive commodity hardware, delivers high aggregate performance • “MapReduce: Simplified Data Processing on Large Clusters” - April 2004 http://queue.acm.org/detail.cfm?id=988408 – Describes a programming model and an implementation for processing large data sets. 1. 2. map function that processes a key/value pair to generate a set of intermediate key/value pairs reduce function that merges all intermediate values associated with the same intermediate key
  • 6. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • It is written in Java, utilizes JVMs • Named after it’s creator’s (Doug Cutting, Yahoo) son’s toy elephant • Hadoop is managing a cluster of commodity hardware computers. The cluster is composed of a single master node and multiple worker nodes
  • 7. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate
  • 8. MapReduce • Hadoop leverages the programming model of map/reduce. It is optimized for processing large data sets. • MapReduce is an essential technique to do distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
  • 9. MapReduce • It is about two functions: map and reduce 1. Map Step: – it is about dividing the problem into smaller subproblems. A master node has the job of distributing the work to worker nodes. The worker node just does one thing and returns the work back to the master node. 2. Reduce Step: – Once the master gets the work from the worker nodes, the reduce step takes over and combines all the work. By combining the work you can form some answer and ultimately output.
  • 10. MapReduce – Map step • There is a master node and many slave nodes. • The master node takes the input, divides it into smaller sub-problems, and distributes the input to worker or slave nodes. worker node may do this again in turn, leading to a multi-level tree structure. • The worker/slave nodes processes the data into a smaller problem, and passes the answer back to its master node. • Each mapping operation is independent of the others, all maps can be performed in parallel.
  • 11. MapReduce – Reduce step • The master node then collects the answers from the worker or slave nodes. It then aggregates the answers and creates the needed output, which is the answer to the problem it was originally trying to solve. • Reducers can also preform the reduction phase in parallel. That is how the system can process petabytes in a matter of hours.
  • 12. Map, Shuffle, and Reduce https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 15. Figures • Following: some figures from the book Hadoop: The Definitive Guide, 3rd Edition
  • 16. A client reading data from HDFS
  • 17. A client writing data to HDFS
  • 19.
  • 20. MapReduce data flow with a single reduce task
  • 21. MapReduce data flow with multiple reduce tasks
  • 22. Hadoop architecture Presentation Layer Web Browser (JS) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Advanced Query Engine (Hive, Pig) Computing Layer (MapReduce) Storage Layer (HDFS) Data Integration Layer Flume Sqoop Log Data RDBMS
  • 23. Demo • Google Compute Engine + Google Cloud Storage • Using Ubuntu as a remote control host • Following the tutorial of: – https://github.com/GoogleCloudPlatform/solutions-google-computeengine-cluster-for-hadoop – Hadoop on Google Compute Engine for Processing Big Data: https://www.youtube.com/watch?v=se9vV8eIZME • The example hadoop job is an advanced version of word count in perl or python: the words are sorted by length and abc • Showing also Google Developer Tool web interface
  • 24. References • Google’s tutorial (see github and YouTube link of the Demo) • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Lynn Langit’s various presentations and YouTube videos • Dattatrey Sindol: Big Data Basics - Part 1 - Introduction to Big Data • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction) • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know
  • 25. Thanks for your attention!

Editor's Notes

  1. consists of very large volumes of heterogeneous data that is being generated, often, at high speeds.  These data sets cannot be managed and processed using traditional data management tools and applications at hand.  Big Data requires the use of a new set of tools, applications and frameworks to process and manage the data.
  2. However, it is not just about the total size of data (volume)It is also about the velocity (how rapidly is the data arriving)What is the structure? Does it have variations?ScienceScientists are regularly challenged by large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. SensorsData sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.Social networksI am thinking of Facebook, LinkedIn, Yahoo, GoogleSocial influencersBlog comments, YELP likes, Twitter, Facebook likes, Apple's app store, Amazon, ZDNet, etcLog filesComputer and mobile device log files, web site tracking information, application logs, and sensor data. But there are also sensors from vehicles, video games, cable boxes or, soon, household appliancesPublic Data StoresMicrosoft Azure MarketPlace/DataMarket, The World Bank, SEC/Edgar, Wikipedia, IMDbData warehouse appliancesTeradata, IBM Netezza, EMC Greenplum, which includes internal, transactional data that is already prepared for analysis Network and in-stream monitoring technologiesPackets in TCP/IP, email, etcLegacy documentsArchives of statements, insurance forms, medical record and customer correspondence VolumeVolume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly.  This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world.VelocityVelocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today's competitive world, decision makers want the necessary data/information in the least amount of time as possible.  Generally, in near real time or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world.VarietyVariety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today's world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn't have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today's world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like clickstream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, censor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world.
  3. The Google File System It is about a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clientshttp://research.google.com/archive/gfs.htmlMapReduce: Simplified Data Processing on Large Clusters MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keyhttp://research.google.com/archive/mapreduce.html