In15orlesss hadoop

hadoop
in 15 slides or less
Alex Pongpech 2016
Adopted from Apache and wiki

Apache
Hadoop
 Apache Hadoop
 a framework for running applications on large cluster built of
commodity hardware.
implement in computational paradigm named Map/Reduce, where the
application is divided into many small fragments of work, each of
which may be executed or re-executed on any node in the cluster.
provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the
cluster
2

Core
technologies
 HDFS: Hadoop Distribute File System:
is a distributed file system to data across Hadoop clusters,
Is highly fault-tolerant,
is designed to support very large files,
 is designed to be deployed on low-cost hardware.
is designed more for batch processing rather than interactive use by
users,
Is a master/slave architecture
 MapReduce
MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-
parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner
read input data from disk, map a function across the data, reduce the
results of the map, and store reduction results on disk.
3

Core
technologies:
HDFS,
MapReduce,
YARN,
SPARK
 YARN
is the resource manager
is a cluster management technology.
large-scale, distributed operating system for big data applications
is the architectural center of Hadoop that allows multiple data
processing engines such as interactive SQL, real-time streaming, data
science and batch processing to handle data stored in a single
platform, unlocking an entirely new approach to analytics.
 Spark
is an open source cluster computing framework. Originally developed
at the University of California, Berkeley's
was developed in response to limitations in the MapReduce cluster
computing paradigm, which forces a particular linear dataflow
structure on distributed programs:
4

Database and
data
management
 Cassandra
is a distributed database management system designed to handle large
amounts of data across many commodity servers,
providing high availability with no single point of failure.
 Hbase
 is an open source, non-relational, distributed database modeled after
Google's BigTable and is written in Java.
is developed as part of Hadoop and runs on top of HDFS providing
BigTable-like capabilities for Hadoop.
• MongoDB
is a free and open-source cross-platform document-oriented database
program.
uses JSON-like documents with schemas.
 Hive
is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis
7

Serialization
 Arvo
is a remote procedure call and data serialization framework developed
within Apache's Hadoop project.
 uses JSON for defining data types and protocols, and serializes data in a
compact binary format
provide both a serialization format for persistent data, and a wire format
for communication between Hadoop nodes, and from client programs to
the Hadoop services.
 JSON
is an open-standard format that uses human-readable text to transmit data
objects consisting of attribute–value pairs.
is the most common data format used for asynchronous browser/server
communication, largely replacing XML which is used by Ajax.
 Parquet
is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data
model or programming language.
is similar to the other columnar storage file formats available in Hadoop
namely RCFile and Optimized RCFile
8

Management
and
monitoring
 Puppet
is an open-source configuration management tool. It runs on many
Unix-like systems as well as on Microsoft Windows, and includes its
own declarative language to describe system configuration.
 Chef
is a configuration management tool written in Ruby and Erlang.
uses a pure-Ruby, domain-specific language (DSL) for writing system
configuration "recipes".
is used to streamline the task of configuring and maintaining a
company's servers, and can integrate with cloud-based platforms such
as Internap, Amazon EC2, Google Cloud Platform, OpenStack,
SoftLayer, Microsoft Azure and Rackspace to automatically provision
and configure new machines.
9

Management
and
monitoring
 Zookeeper
is essentially a distributed hierarchical key-value store, which is used
to provide a distributed configuration service, synchronization
service, and naming registry for large distributed systems.
supports high availability through redundant services.
store their data in a hierarchical name space, much like a file system
or a tree data structure
is used by companies including Rackspace, Yahoo!, Odnoklassniki,
Redditand eBay as well as open source enterprise search systems like
Solr.
 Oozie
is a workflow scheduler system to manage Apache Hadoop jobs.
is integrated with the rest of the Hadoop stack supporting several
types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as
system specific jobs (such as Java programs and shell scripts).
 is a scalable, reliable and extensible system.
10

Analytic
 Pig
is a high-level platform for creating programs that run on Apache
Hadoop.
can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache
Spark.
abstracts the programming from the Java MapReduce idiom into a
notation which makes MapReduce programming high level, similar to
that of SQL for RDBMSs.
 Mahout
produce free implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.
shifts its focus to building backend-independent programming
environment, code named "Samsara".
supported algebraic platforms are Apache Spark and H2O, and
Apache Flink.
11

Analytic
 MLlib
is Apache Spark's scalable machine learning library.
is a distributed machine learning framework on top of Spark Core
Many common machine learning and statistical algorithms have been
implemented and are shipped with MLlib which simplifies large scale
machine learning pipelines, including:
summary statistics, correlations, stratified sampling, hypothesis testing,
random data generation[16]
classification and regression: support vector machines, logistic
regression, linear regression, decision trees, naive Bayes classification
collaborative filtering techniques including alternating least squares
(ALS)
cluster analysis methods including k-means, and Latent Dirichlet
Allocation (LDA)
dimensionality reduction techniques such as singular value
decomposition (SVD), and principal component analysis (PCA)
feature extraction and transformation functions
optimization algorithms such as stochastic gradient descent, limited-
memory BFGS (L-BFGS)
12

DataTransfer
 Sqoop
 is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
 Flume
 is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data.
 has a simple and flexible architecture based on streaming data flows.
 is robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms.
 uses a simple extensible data model that allows for online analytic application.
 Distcp
 is a tool used for large inter/intra-cluster copying.
 uses MapReduce to effect its distribution, error handling and recovery, and
reporting.
 Storm
 is a distributed stream processing computation framework
 uses custom created "spouts" and "bolts" to define information sources and
manipulations to allow batch, distributed processing of streaming data.
 A Storm application is designed as a "topology" in the shape of a directed acyclic
graph (DAG) with spouts and bolts acting as the graph vertices.
13

Security,
access control
and auditing
 Sentry
is a system for enforcing fine grained role based authorization to data
and metadata stored on a Hadoop cluster.
 Kerberos
is a computer network authentication protocol that works on the basis
of 'tickets' to allow nodes communicating over a non-secure network
to prove their identity to one another in a secure manner.
designers aimed it primarily at a client–server model and it provides
mutual authentication—both the user and the server verify each
other's identity
14

Cloud
computing and
virtualization
 Serengeti
open-source project, called “Serengeti,” that aims to let the Hadoop data-
processing platform run on the virtualization leader’s vSphere hypervisor.
 Docker
is an open-source project that automates the deployment of Linux
applications inside software containers.
provides an additional layer of abstraction and automation of operating-
system-level virtualization on Linux.
uses the resource isolation features of the Linux kernel such as cgroups
and kernel namespaces, and a union-capable file system such as aufs and
others[7] to allow independent "containers" to run within a single Linux
instance, avoiding the overhead of starting and maintaining virtual
machines.
 Whirr
is a set of libraries for running cloud services.
provides:
 A cloud-neutral way to run services. You don't have to worry about the
idiosyncrasies of each provider.
 A common service API. The details of provisioning are particular to the
service.
 Smart defaults for services. You can get a properly configured system running
quickly, while still being able to override settings as needed.
15

In15orlesss hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to In15orlesss hadoop

Similar to In15orlesss hadoop (20)

More from Worapol Alex Pongpech, PhD

More from Worapol Alex Pongpech, PhD (9)

Recently uploaded

Recently uploaded (20)

In15orlesss hadoop

Editor's Notes