SlideShare a Scribd company logo
HADOOP (A BIG DATA INITIATIVE)
-Mansi Mehra
AGENDA
 Defining the problem – 3Vs
 Why traditional storages don’t work
 How does Hadoop work?
 HDFS (Hadoop 1.0 Vs 2.0)
 YARN (2.0)- Yet Another Resource Negotiator
 Map Reduce
 When we don’t know how to code
 Hive (Overview)
 PIG (Overview)
 Hbase (Overview)
 Zookeeper (Overview)
 Spark (Overview)
DEFINING THE PROBLEM – 3VS
 Volume - Lots and lots of data
 Datasets are so large and complex
 Cannot use relational database
 Challenges: capture, curation, storage, search, sharing,
transfer, analysis and visualization.
DEFINING THE PROBLEM – 3V (CONTD.)
 Velocity - Huge amounts of data generated at
incredible speed
 NYSE generates about 1 TB of new trade data per day
 AT&T anonymized Call Detail Records (CDRs) top at
around 1 GB per hour.
 Variety - Differently formatted data sets from
different sources
 Twitter keeps tracks of tweets, Facebook produces
posts and likes data, Youtube streams videos)
WHY TRADITIONAL STORAGES DON’T WORK
 Unstructured data is exploding, not much of data
produced has relational nature.
 No redundancy
 High computational cost
 Capacity limit for structured data (costly hardware)
 Expensive License
Data type Nature
XML Semi-structured
Word docs, PDF files etc. Unstructured
Email body Unstructured
Data from Enterprise Systems
(ERP, CRM etc.)
Structured
HOW DOES HADOOP WORK?
HDFS (HADOOP 1.0 VS 2.0)
HDFS (HADOOP 2.0)
YARN (2.0)- YET ANOTHER RESOURCE
NEGOTIATOR
 Computing framework for Hadoop.
 YARN has Resource Manager-
 Manages and allocates cluster resources
 Improves performance and Quality of Service
MAP REDUCE
 Programming model in Java
 Work on large amounts of data
 Provides redundancy & fault tolerance
 Runs the code on each data node
MAP REDUCE (CONTD.)
 Steps for Map Reduce:
 Read in lots of data
 Map: extract something you care about from each
record/line.
 Shuffle and sort
 Reduce: aggregate, summarize, filter or transform
 Write results.
WHEN WE DON’T KNOW HOW TO CODE
HIVE (OVERVIEW)
 Data warehouse infrastructure built on top of Hadoop
 Compile SQL queries as MapReduce jobs and run the job in
the cluster.
 Brings structure to unstructured data
 Key Building Principles:
 Structured data with rich data types (structs, lists and maps)
 Directly query data from different formats (text/binary) and file
formats (Flat/sequence).
 SQL as a familiar programming tool and for standard analytics
 Types of applications:
 Summarization: Daily/weekly aggregations
 Ad hoc analysis
 Data Mining
 Spam detection
 Many more ….
PIG (OVERVIEW)
 High level dataflow language
 Has its own syntax (Preferable for people with
programming background)
 Compiler that produces sequences of MapReduce
programs.
 Structure is agreeable to substantial parallelization.
 Key properties of PIG:
 Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
 Optimization opportunities: allows user to focus on semantics
rather than efficiency.
 Extensibility: Users can create their own functions to do
special purpose processing.
HBASE (OVERVIEW)
 HBase is a distributed column-oriented data store built
on top of HDFS.
 Data is logically organized into tables, rows and
columns.
 HDFS is good for batch processing (scan over big files).
 Not good for record lookup.
 Not good for incremental addition of small batches.
 Not good for updates.
 HBase is designed to efficiently address the above
points
 Fast record lookup
 Support for record level insertion
 Support for updates (not in place).
 Updates are done by creating new versions of values.
ZOOKEEPER (OVERVIEW)
 Zookeeper is a distributed, open source
coordination service for distributed applications.
 Exposes simple set of primitives that distributed
applications can build upon to implement higher
level services for synchronization, configuration
maintenance, and groups and naming.
 Coordination services are notoriously hard to get
right. They are prone to errors like race conditions
and deadlock.
 The motivation behind zookeeper is to relieve
distributed applications the responsibility of
implementing coordination services from scratch.
SPARK (OVERVIEW)
 Motivation : MapReduce programming model transform data
flowing from stable storage to stable storage (disk to disk).
 Acyclic data flow is a powerful abstraction, but not efficient for
applications that repeatedly reuse a working set of data.
 Iterative algorithms
 Interactive data mining
 Spark makes working sets a first-class concept to efficiently
support these applications.
 Goal:
 To provide distributed memory abstractions for clusters to support
apps with working sets.
 Retain the attractive properties of map reduce.
 Fault tolerance
 Data locality
 Scalability
 Augment data flow model with “resilient distributed datasets”
(RDDs)
SPARK (OVERVIEW CONTD.)
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can
be rebuilt if a partition is lost.
 Created by transforming data in stable storage using
data flow operators (map, filter, group-by, ..)
 Can be cached across parallel operations.
 Parallel operations on RDDs.
 Reduce, collect, count, save, …..
 Restricted shared variables
 Accumulators, broadcast variables.
SPARK (OVERVIEW CONTD)
 Fast map reduce like engine
 Uses in memory cluster computing
 Compatible with Hadoop storage API.
 Has API’s written in Scala, Java, Python.
 Useful for large datasets and iterative algorithms.
 Up to 40x faster than MapReduce.
 Support for:
 Spark SQL : Hive on Spark
 Mlib : Machine learning library
 Graphx : Graph processing.
THANK YOU!

More Related Content

What's hot

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
Big data
Big dataBig data
Big data
Mina Soltani
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
RojaT4
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
Aditi Yadav
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
datastack
 
Big data
Big dataBig data
Big data
chahat aggarwal
 
Big data
Big dataBig data
Big data
revathireddyb
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
saisreealekhya
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
TejashBansal2
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Mongo db
Mongo dbMongo db
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
databloginfo
 
Hadoop
HadoopHadoop
Hadoop
HadoopHadoop
Hadoop
Ankit Prasad
 

What's hot (20)

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big data
Big dataBig data
Big data
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Bigdata and hadoop
Bigdata and hadoopBigdata and hadoop
Bigdata and hadoop
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Mongo db
Mongo dbMongo db
Mongo db
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Hadoop - A big data initiative

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
datastack
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Tyrone Systems
 
Big data
Big dataBig data
Big data
revathireddyb
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
C. Scyphers
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
snehal parikh
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Mohammadhasan Farazmand
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
Bhavya Gulati
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 

Similar to Hadoop - A big data initiative (20)

عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data
Big dataBig data
Big data
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFSImplementation of Multi-node Clusters in Column Oriented Database using HDFS
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Big data
Big dataBig data
Big data
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
hadoop
hadoophadoop
hadoop
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 

Hadoop - A big data initiative

  • 1. HADOOP (A BIG DATA INITIATIVE) -Mansi Mehra
  • 2. AGENDA  Defining the problem – 3Vs  Why traditional storages don’t work  How does Hadoop work?  HDFS (Hadoop 1.0 Vs 2.0)  YARN (2.0)- Yet Another Resource Negotiator  Map Reduce  When we don’t know how to code  Hive (Overview)  PIG (Overview)  Hbase (Overview)  Zookeeper (Overview)  Spark (Overview)
  • 3. DEFINING THE PROBLEM – 3VS  Volume - Lots and lots of data  Datasets are so large and complex  Cannot use relational database  Challenges: capture, curation, storage, search, sharing, transfer, analysis and visualization.
  • 4. DEFINING THE PROBLEM – 3V (CONTD.)  Velocity - Huge amounts of data generated at incredible speed  NYSE generates about 1 TB of new trade data per day  AT&T anonymized Call Detail Records (CDRs) top at around 1 GB per hour.  Variety - Differently formatted data sets from different sources  Twitter keeps tracks of tweets, Facebook produces posts and likes data, Youtube streams videos)
  • 5. WHY TRADITIONAL STORAGES DON’T WORK  Unstructured data is exploding, not much of data produced has relational nature.  No redundancy  High computational cost  Capacity limit for structured data (costly hardware)  Expensive License Data type Nature XML Semi-structured Word docs, PDF files etc. Unstructured Email body Unstructured Data from Enterprise Systems (ERP, CRM etc.) Structured
  • 9. YARN (2.0)- YET ANOTHER RESOURCE NEGOTIATOR  Computing framework for Hadoop.  YARN has Resource Manager-  Manages and allocates cluster resources  Improves performance and Quality of Service
  • 10. MAP REDUCE  Programming model in Java  Work on large amounts of data  Provides redundancy & fault tolerance  Runs the code on each data node
  • 11. MAP REDUCE (CONTD.)  Steps for Map Reduce:  Read in lots of data  Map: extract something you care about from each record/line.  Shuffle and sort  Reduce: aggregate, summarize, filter or transform  Write results.
  • 12. WHEN WE DON’T KNOW HOW TO CODE
  • 13. HIVE (OVERVIEW)  Data warehouse infrastructure built on top of Hadoop  Compile SQL queries as MapReduce jobs and run the job in the cluster.  Brings structure to unstructured data  Key Building Principles:  Structured data with rich data types (structs, lists and maps)  Directly query data from different formats (text/binary) and file formats (Flat/sequence).  SQL as a familiar programming tool and for standard analytics  Types of applications:  Summarization: Daily/weekly aggregations  Ad hoc analysis  Data Mining  Spam detection  Many more ….
  • 14. PIG (OVERVIEW)  High level dataflow language  Has its own syntax (Preferable for people with programming background)  Compiler that produces sequences of MapReduce programs.  Structure is agreeable to substantial parallelization.  Key properties of PIG:  Ease of programming: Trivial to achieve parallel execution of simple and parallel data analysis tasks  Optimization opportunities: allows user to focus on semantics rather than efficiency.  Extensibility: Users can create their own functions to do special purpose processing.
  • 15. HBASE (OVERVIEW)  HBase is a distributed column-oriented data store built on top of HDFS.  Data is logically organized into tables, rows and columns.  HDFS is good for batch processing (scan over big files).  Not good for record lookup.  Not good for incremental addition of small batches.  Not good for updates.  HBase is designed to efficiently address the above points  Fast record lookup  Support for record level insertion  Support for updates (not in place).  Updates are done by creating new versions of values.
  • 16. ZOOKEEPER (OVERVIEW)  Zookeeper is a distributed, open source coordination service for distributed applications.  Exposes simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.  Coordination services are notoriously hard to get right. They are prone to errors like race conditions and deadlock.  The motivation behind zookeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.
  • 17. SPARK (OVERVIEW)  Motivation : MapReduce programming model transform data flowing from stable storage to stable storage (disk to disk).  Acyclic data flow is a powerful abstraction, but not efficient for applications that repeatedly reuse a working set of data.  Iterative algorithms  Interactive data mining  Spark makes working sets a first-class concept to efficiently support these applications.  Goal:  To provide distributed memory abstractions for clusters to support apps with working sets.  Retain the attractive properties of map reduce.  Fault tolerance  Data locality  Scalability  Augment data flow model with “resilient distributed datasets” (RDDs)
  • 18. SPARK (OVERVIEW CONTD.)  Resilient distributed datasets (RDDs)  Immutable collections partitioned across cluster that can be rebuilt if a partition is lost.  Created by transforming data in stable storage using data flow operators (map, filter, group-by, ..)  Can be cached across parallel operations.  Parallel operations on RDDs.  Reduce, collect, count, save, …..  Restricted shared variables  Accumulators, broadcast variables.
  • 19. SPARK (OVERVIEW CONTD)  Fast map reduce like engine  Uses in memory cluster computing  Compatible with Hadoop storage API.  Has API’s written in Scala, Java, Python.  Useful for large datasets and iterative algorithms.  Up to 40x faster than MapReduce.  Support for:  Spark SQL : Hive on Spark  Mlib : Machine learning library  Graphx : Graph processing.

Editor's Notes

  1. Founder: Doug Cutting. He named it after his son’s toy elephant
  2. Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives. It is essential to look after the NameNode. Here are some recommendations from production use Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size. Use ECC RAM. On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down. List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data. Configure the NameNode to store one set of transaction logs on a separate disk from the image. Configure the NameNode to store another set of transaction logs to a network mounted disk. Monitor the disk space available to the NameNode. If free space is getting low, add more storage. Do not host DataNode, JobTracker or TaskTracker services on the same system. DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data. DataNode instances can talk to each other, which is what they do when they are replicating data. There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server. An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data. Avoid using NFS for data storage in production system. Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications. Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams). Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
  3. Active Namenode: In order to provide HDFS high availability, we have an active and standby NameNode in the architecture now. Namenode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data itself, just reference meta data. Client applications talk to the namenode whenever they wish to locate a file, or add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives. It is essential to look after the NameNode. Here are some recommendations from production use Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size. Use ECC RAM. On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down. List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data. Configure the NameNode to store one set of transaction logs on a separate disk from the image. Configure the NameNode to store another set of transaction logs to a network mounted disk. Monitor the disk space available to the NameNode. If free space is getting low, add more storage. Do not host DataNode, JobTracker or TaskTracker services on the same system. DataNodes: A datanode stores data in the HDFS. A functional file system has more than one Datanode, with data replicated across them. On startup, the datanode connects to the Namenode, spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host Datanode instances, so that MapReduce operations are performed close to the data. DataNode instances can talk to each other, which is what they do when they are replicating data. There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server. An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data. Avoid using NFS for data storage in production system. Node Manager: The NM is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to-date with Resource Manager(RM), overseeing containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications. Resource Manager: RM is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NMs and the per-application ApplicationMasters (Ams). Application Masters: are responsible for negotiating resources with the ResourceManager and for working with the Node Managers to start the containers.
  4. Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS).  HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.  Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data. Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming. Ambari: A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters.