2. Apache
Hadoop
Apache Hadoop
a framework for running applications on large cluster built of
commodity hardware.
implement in computational paradigm named Map/Reduce, where the
application is divided into many small fragments of work, each of
which may be executed or re-executed on any node in the cluster.
provides a distributed file system (HDFS) that stores data on the
compute nodes, providing very high aggregate bandwidth across the
cluster
2
3. Core
technologies
HDFS: Hadoop Distribute File System:
is a distributed file system to data across Hadoop clusters,
Is highly fault-tolerant,
is designed to support very large files,
is designed to be deployed on low-cost hardware.
is designed more for batch processing rather than interactive use by
users,
Is a master/slave architecture
MapReduce
MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-
parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner
read input data from disk, map a function across the data, reduce the
results of the map, and store reduction results on disk.
3
4. Core
technologies:
HDFS,
MapReduce,
YARN,
SPARK
YARN
is the resource manager
is a cluster management technology.
large-scale, distributed operating system for big data applications
is the architectural center of Hadoop that allows multiple data
processing engines such as interactive SQL, real-time streaming, data
science and batch processing to handle data stored in a single
platform, unlocking an entirely new approach to analytics.
Spark
is an open source cluster computing framework. Originally developed
at the University of California, Berkeley's
was developed in response to limitations in the MapReduce cluster
computing paradigm, which forces a particular linear dataflow
structure on distributed programs:
4
7. Database and
data
management
Cassandra
is a distributed database management system designed to handle large
amounts of data across many commodity servers,
providing high availability with no single point of failure.
Hbase
is an open source, non-relational, distributed database modeled after
Google's BigTable and is written in Java.
is developed as part of Hadoop and runs on top of HDFS providing
BigTable-like capabilities for Hadoop.
• MongoDB
is a free and open-source cross-platform document-oriented database
program.
uses JSON-like documents with schemas.
Hive
is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis
7
8. Serialization
Arvo
is a remote procedure call and data serialization framework developed
within Apache's Hadoop project.
uses JSON for defining data types and protocols, and serializes data in a
compact binary format
provide both a serialization format for persistent data, and a wire format
for communication between Hadoop nodes, and from client programs to
the Hadoop services.
JSON
is an open-standard format that uses human-readable text to transmit data
objects consisting of attribute–value pairs.
is the most common data format used for asynchronous browser/server
communication, largely replacing XML which is used by Ajax.
Parquet
is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data
model or programming language.
is similar to the other columnar storage file formats available in Hadoop
namely RCFile and Optimized RCFile
8
9. Management
and
monitoring
Puppet
is an open-source configuration management tool. It runs on many
Unix-like systems as well as on Microsoft Windows, and includes its
own declarative language to describe system configuration.
Chef
is a configuration management tool written in Ruby and Erlang.
uses a pure-Ruby, domain-specific language (DSL) for writing system
configuration "recipes".
is used to streamline the task of configuring and maintaining a
company's servers, and can integrate with cloud-based platforms such
as Internap, Amazon EC2, Google Cloud Platform, OpenStack,
SoftLayer, Microsoft Azure and Rackspace to automatically provision
and configure new machines.
9
10. Management
and
monitoring
Zookeeper
is essentially a distributed hierarchical key-value store, which is used
to provide a distributed configuration service, synchronization
service, and naming registry for large distributed systems.
supports high availability through redundant services.
store their data in a hierarchical name space, much like a file system
or a tree data structure
is used by companies including Rackspace, Yahoo!, Odnoklassniki,
Redditand eBay as well as open source enterprise search systems like
Solr.
Oozie
is a workflow scheduler system to manage Apache Hadoop jobs.
is integrated with the rest of the Hadoop stack supporting several
types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as
system specific jobs (such as Java programs and shell scripts).
is a scalable, reliable and extensible system.
10
11. Analytic
Pig
is a high-level platform for creating programs that run on Apache
Hadoop.
can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache
Spark.
abstracts the programming from the Java MapReduce idiom into a
notation which makes MapReduce programming high level, similar to
that of SQL for RDBMSs.
Mahout
produce free implementations of distributed or otherwise scalable
machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification.
shifts its focus to building backend-independent programming
environment, code named "Samsara".
supported algebraic platforms are Apache Spark and H2O, and
Apache Flink.
11
12. Analytic
MLlib
is Apache Spark's scalable machine learning library.
is a distributed machine learning framework on top of Spark Core
Many common machine learning and statistical algorithms have been
implemented and are shipped with MLlib which simplifies large scale
machine learning pipelines, including:
summary statistics, correlations, stratified sampling, hypothesis testing,
random data generation[16]
classification and regression: support vector machines, logistic
regression, linear regression, decision trees, naive Bayes classification
collaborative filtering techniques including alternating least squares
(ALS)
cluster analysis methods including k-means, and Latent Dirichlet
Allocation (LDA)
dimensionality reduction techniques such as singular value
decomposition (SVD), and principal component analysis (PCA)
feature extraction and transformation functions
optimization algorithms such as stochastic gradient descent, limited-
memory BFGS (L-BFGS)
12
13. DataTransfer
Sqoop
is a tool designed for efficiently transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
Flume
is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data.
has a simple and flexible architecture based on streaming data flows.
is robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms.
uses a simple extensible data model that allows for online analytic application.
Distcp
is a tool used for large inter/intra-cluster copying.
uses MapReduce to effect its distribution, error handling and recovery, and
reporting.
Storm
is a distributed stream processing computation framework
uses custom created "spouts" and "bolts" to define information sources and
manipulations to allow batch, distributed processing of streaming data.
A Storm application is designed as a "topology" in the shape of a directed acyclic
graph (DAG) with spouts and bolts acting as the graph vertices.
13
14. Security,
access control
and auditing
Sentry
is a system for enforcing fine grained role based authorization to data
and metadata stored on a Hadoop cluster.
Kerberos
is a computer network authentication protocol that works on the basis
of 'tickets' to allow nodes communicating over a non-secure network
to prove their identity to one another in a secure manner.
designers aimed it primarily at a client–server model and it provides
mutual authentication—both the user and the server verify each
other's identity
14
15. Cloud
computing and
virtualization
Serengeti
open-source project, called “Serengeti,” that aims to let the Hadoop data-
processing platform run on the virtualization leader’s vSphere hypervisor.
Docker
is an open-source project that automates the deployment of Linux
applications inside software containers.
provides an additional layer of abstraction and automation of operating-
system-level virtualization on Linux.
uses the resource isolation features of the Linux kernel such as cgroups
and kernel namespaces, and a union-capable file system such as aufs and
others[7] to allow independent "containers" to run within a single Linux
instance, avoiding the overhead of starting and maintaining virtual
machines.
Whirr
is a set of libraries for running cloud services.
provides:
A cloud-neutral way to run services. You don't have to worry about the
idiosyncrasies of each provider.
A common service API. The details of provisioning are particular to the
service.
Smart defaults for services. You can get a properly configured system running
quickly, while still being able to override settings as needed.
15
Editor's Notes
: HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail.
A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[3]
: HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a node fails. That decreases the risk of catastrophic failure, even in the event that numerous nodes fail.
A distributed file system is a client/server-based application that allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is cached on the user's computer while the data is being processed and is then returned to the server.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[3]
is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.