SlideShare a Scribd company logo
1 of 15
hadoop
in 15 slides or less
Alex Pongpech 2016
Adopted from Apache and wiki
Apache
Hadoop
 Apache Hadoop
 a framework for running applications on large cluster built of
commodity hardware.
implement in computational paradigm named Map/Reduce, where the
application is divided into many small fragments of work, each of
which may be executed or re-executed on any node in the cluster.
provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the
cluster
2
Core
technologies
 HDFS: Hadoop Distribute File System:
is a distributed file system to data across Hadoop clusters,
Is highly fault-tolerant,
is designed to support very large files,
 is designed to be deployed on low-cost hardware.
is designed more for batch processing rather than interactive use by
users,
Is a master/slave architecture
 MapReduce
MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-
parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner
read input data from disk, map a function across the data, reduce the
results of the map, and store reduction results on disk.
3
Core
technologies:
HDFS,
MapReduce,
YARN,
SPARK
 YARN
is the resource manager
is a cluster management technology.
large-scale, distributed operating system for big data applications
is the architectural center of Hadoop that allows multiple data
processing engines such as interactive SQL, real-time streaming, data
science and batch processing to handle data stored in a single
platform, unlocking an entirely new approach to analytics.
 Spark
is an open source cluster computing framework. Originally developed
at the University of California, Berkeley's
was developed in response to limitations in the MapReduce cluster
computing paradigm, which forces a particular linear dataflow
structure on distributed programs:
4
Apache
YARN
5
HDFS
Architecture
6
Database and
data
management
 Cassandra
is a distributed database management system designed to handle large
amounts of data across many commodity servers,
providing high availability with no single point of failure.
 Hbase
 is an open source, non-relational, distributed database modeled after
Google's BigTable and is written in Java.
is developed as part of Hadoop and runs on top of HDFS providing
BigTable-like capabilities for Hadoop.
• MongoDB
is a free and open-source cross-platform document-oriented database
program.
uses JSON-like documents with schemas.
 Hive
is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis
7
Serialization
 Arvo
is a remote procedure call and data serialization framework developed
within Apache's Hadoop project.
 uses JSON for defining data types and protocols, and serializes data in a
compact binary format
provide both a serialization format for persistent data, and a wire format
for communication between Hadoop nodes, and from client programs to
the Hadoop services.
 JSON
is an open-standard format that uses human-readable text to transmit data
objects consisting of attribute–value pairs.
is the most common data format used for asynchronous browser/server
communication, largely replacing XML which is used by Ajax.
 Parquet
is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data
model or programming language.
is similar to the other columnar storage file formats available in Hadoop
namely RCFile and Optimized RCFile
8
Management
and
monitoring
 Puppet
is an open-source configuration management tool. It runs on many
Unix-like systems as well as on Microsoft Windows, and includes its
own declarative language to describe system configuration.
 Chef
is a configuration management tool written in Ruby and Erlang.
uses a pure-Ruby, domain-specific language (DSL) for writing system
configuration "recipes".
is used to streamline the task of configuring and maintaining a
company's servers, and can integrate with cloud-based platforms such
as Internap, Amazon EC2, Google Cloud Platform, OpenStack,
SoftLayer, Microsoft Azure and Rackspace to automatically provision
and configure new machines.
9
Management
and
monitoring
 Zookeeper
is essentially a distributed hierarchical key-value store, which is used
to provide a distributed configuration service, synchronization
service, and naming registry for large distributed systems.
supports high availability through redundant services.
store their data in a hierarchical name space, much like a file system
or a tree data structure
is used by companies including Rackspace, Yahoo!, Odnoklassniki,
Redditand eBay as well as open source enterprise search systems like
Solr.
 Oozie
is a workflow scheduler system to manage Apache Hadoop jobs.
is integrated with the rest of the Hadoop stack supporting several
types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as
system specific jobs (such as Java programs and shell scripts).
 is a scalable, reliable and extensible system.
10
Analytic
 Pig
is a high-level platform for creating programs that run on Apache
Hadoop.
can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache
Spark.
abstracts the programming from the Java MapReduce idiom into a
notation which makes MapReduce programming high level, similar to
that of SQL for RDBMSs.
 Mahout
produce free implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.
shifts its focus to building backend-independent programming
environment, code named "Samsara".
supported algebraic platforms are Apache Spark and H2O, and
Apache Flink.
11
Analytic
 MLlib
is Apache Spark's scalable machine learning library.
is a distributed machine learning framework on top of Spark Core
Many common machine learning and statistical algorithms have been
implemented and are shipped with MLlib which simplifies large scale
machine learning pipelines, including:
summary statistics, correlations, stratified sampling, hypothesis testing,
random data generation[16]
classification and regression: support vector machines, logistic
regression, linear regression, decision trees, naive Bayes classification
collaborative filtering techniques including alternating least squares
(ALS)
cluster analysis methods including k-means, and Latent Dirichlet
Allocation (LDA)
dimensionality reduction techniques such as singular value
decomposition (SVD), and principal component analysis (PCA)
feature extraction and transformation functions
optimization algorithms such as stochastic gradient descent, limited-
memory BFGS (L-BFGS)
12
DataTransfer
 Sqoop
 is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
 Flume
 is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data.
 has a simple and flexible architecture based on streaming data flows.
 is robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms.
 uses a simple extensible data model that allows for online analytic application.
 Distcp
 is a tool used for large inter/intra-cluster copying.
 uses MapReduce to effect its distribution, error handling and recovery, and
reporting.
 Storm
 is a distributed stream processing computation framework
 uses custom created "spouts" and "bolts" to define information sources and
manipulations to allow batch, distributed processing of streaming data.
 A Storm application is designed as a "topology" in the shape of a directed acyclic
graph (DAG) with spouts and bolts acting as the graph vertices.
13
Security,
access control
and auditing
 Sentry
is a system for enforcing fine grained role based authorization to data
and metadata stored on a Hadoop cluster.
 Kerberos
is a computer network authentication protocol that works on the basis
of 'tickets' to allow nodes communicating over a non-secure network
to prove their identity to one another in a secure manner.
designers aimed it primarily at a client–server model and it provides
mutual authentication—both the user and the server verify each
other's identity
14
Cloud
computing and
virtualization
 Serengeti
open-source project, called “Serengeti,” that aims to let the Hadoop data-
processing platform run on the virtualization leader’s vSphere hypervisor.
 Docker
is an open-source project that automates the deployment of Linux
applications inside software containers.
provides an additional layer of abstraction and automation of operating-
system-level virtualization on Linux.
uses the resource isolation features of the Linux kernel such as cgroups
and kernel namespaces, and a union-capable file system such as aufs and
others[7] to allow independent "containers" to run within a single Linux
instance, avoiding the overhead of starting and maintaining virtual
machines.
 Whirr
is a set of libraries for running cloud services.
provides:
 A cloud-neutral way to run services. You don't have to worry about the
idiosyncrasies of each provider.
 A common service API. The details of provisioning are particular to the
service.
 Smart defaults for services. You can get a properly configured system running
quickly, while still being able to override settings as needed.
15

More Related Content

What's hot

Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
sam_resume - updated
sam_resume - updatedsam_resume - updated
sam_resume - updatedsam k
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 

What's hot (20)

Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
HDFS
HDFSHDFS
HDFS
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
sam_resume - updated
sam_resume - updatedsam_resume - updated
sam_resume - updated
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 

Viewers also liked

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
BranchReduce Distributed Branch-and-Bound on YARN
BranchReduce Distributed Branch-and-Bound on YARNBranchReduce Distributed Branch-and-Bound on YARN
BranchReduce Distributed Branch-and-Bound on YARNDataWorks Summit
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Edureka!
 
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVMHypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVMvwchu
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Viewers also liked (17)

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
BranchReduce Distributed Branch-and-Bound on YARN
BranchReduce Distributed Branch-and-Bound on YARNBranchReduce Distributed Branch-and-Bound on YARN
BranchReduce Distributed Branch-and-Bound on YARN
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
Which Hadoop Distribution to use: Apache, Cloudera, MapR or HortonWorks?
 
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVMHypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM
Hypervisors and Virtualization - VMware, Hyper-V, XenServer, and KVM
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Similar to In15orlesss hadoop

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 

Similar to In15orlesss hadoop (20)

Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 

More from Worapol Alex Pongpech, PhD (9)

Blockchain based Customer Relation System
Blockchain based Customer Relation SystemBlockchain based Customer Relation System
Blockchain based Customer Relation System
 
Fast analytics kudu to druid
Fast analytics  kudu to druidFast analytics  kudu to druid
Fast analytics kudu to druid
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Building business intuition from data
Building business intuition from dataBuilding business intuition from data
Building business intuition from data
 
10 basic terms so you can talk to data engineer
10 basic terms so you can  talk to data engineer10 basic terms so you can  talk to data engineer
10 basic terms so you can talk to data engineer
 
Why are we using kubernetes
Why are we using kubernetesWhy are we using kubernetes
Why are we using kubernetes
 
Airflow 4 manager
Airflow 4 managerAirflow 4 manager
Airflow 4 manager
 
Fast Analytics
Fast Analytics Fast Analytics
Fast Analytics
 
Dark data
Dark dataDark data
Dark data
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

In15orlesss hadoop

  • 1. hadoop in 15 slides or less Alex Pongpech 2016 Adopted from Apache and wiki
  • 2. Apache Hadoop  Apache Hadoop  a framework for running applications on large cluster built of commodity hardware. implement in computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster 2
  • 3. Core technologies  HDFS: Hadoop Distribute File System: is a distributed file system to data across Hadoop clusters, Is highly fault-tolerant, is designed to support very large files,  is designed to be deployed on low-cost hardware. is designed more for batch processing rather than interactive use by users, Is a master/slave architecture  MapReduce MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in- parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. 3
  • 4. Core technologies: HDFS, MapReduce, YARN, SPARK  YARN is the resource manager is a cluster management technology. large-scale, distributed operating system for big data applications is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.  Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: 4
  • 7. Database and data management  Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.  Hbase  is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. is developed as part of Hadoop and runs on top of HDFS providing BigTable-like capabilities for Hadoop. • MongoDB is a free and open-source cross-platform document-oriented database program. uses JSON-like documents with schemas.  Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis 7
  • 8. Serialization  Arvo is a remote procedure call and data serialization framework developed within Apache's Hadoop project.  uses JSON for defining data types and protocols, and serializes data in a compact binary format provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.  JSON is an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. is the most common data format used for asynchronous browser/server communication, largely replacing XML which is used by Ajax.  Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. is similar to the other columnar storage file formats available in Hadoop namely RCFile and Optimized RCFile 8
  • 9. Management and monitoring  Puppet is an open-source configuration management tool. It runs on many Unix-like systems as well as on Microsoft Windows, and includes its own declarative language to describe system configuration.  Chef is a configuration management tool written in Ruby and Erlang. uses a pure-Ruby, domain-specific language (DSL) for writing system configuration "recipes". is used to streamline the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms such as Internap, Amazon EC2, Google Cloud Platform, OpenStack, SoftLayer, Microsoft Azure and Rackspace to automatically provision and configure new machines. 9
  • 10. Management and monitoring  Zookeeper is essentially a distributed hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems. supports high availability through redundant services. store their data in a hierarchical name space, much like a file system or a tree data structure is used by companies including Rackspace, Yahoo!, Odnoklassniki, Redditand eBay as well as open source enterprise search systems like Solr.  Oozie is a workflow scheduler system to manage Apache Hadoop jobs. is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).  is a scalable, reliable and extensible system. 10
  • 11. Analytic  Pig is a high-level platform for creating programs that run on Apache Hadoop. can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMSs.  Mahout produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. shifts its focus to building backend-independent programming environment, code named "Samsara". supported algebraic platforms are Apache Spark and H2O, and Apache Flink. 11
  • 12. Analytic  MLlib is Apache Spark's scalable machine learning library. is a distributed machine learning framework on top of Spark Core Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including: summary statistics, correlations, stratified sampling, hypothesis testing, random data generation[16] classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification collaborative filtering techniques including alternating least squares (ALS) cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA) dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA) feature extraction and transformation functions optimization algorithms such as stochastic gradient descent, limited- memory BFGS (L-BFGS) 12
  • 13. DataTransfer  Sqoop  is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.  Flume  is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.  has a simple and flexible architecture based on streaming data flows.  is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.  uses a simple extensible data model that allows for online analytic application.  Distcp  is a tool used for large inter/intra-cluster copying.  uses MapReduce to effect its distribution, error handling and recovery, and reporting.  Storm  is a distributed stream processing computation framework  uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data.  A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. 13
  • 14. Security, access control and auditing  Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.  Kerberos is a computer network authentication protocol that works on the basis of 'tickets' to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. designers aimed it primarily at a client–server model and it provides mutual authentication—both the user and the server verify each other's identity 14
  • 15. Cloud computing and virtualization  Serengeti open-source project, called “Serengeti,” that aims to let the Hadoop data- processing platform run on the virtualization leader’s vSphere hypervisor.  Docker is an open-source project that automates the deployment of Linux applications inside software containers. provides an additional layer of abstraction and automation of operating- system-level virtualization on Linux. uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces, and a union-capable file system such as aufs and others[7] to allow independent "containers" to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines.  Whirr is a set of libraries for running cloud services. provides:  A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider.  A common service API. The details of provisioning are particular to the service.  Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed. 15

Editor's Notes

  1. : HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail. A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server. Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[3]
  2. : HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail. A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server. Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[3]
  3. is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.