Cloudera Hadoop Distribution

What is CDH?
•Popular distribution of Apache
Hadoop and related projects.
•Delivers scalable storage and
distributed computing.
•Apache-licensed open source

Big Data
•Collection of large data sets that cannot be processed using traditional
computing techniques
•Big Data challenges
•Storage
•Capturing data
•Analyze data

Hadoop
•What is Hadoop?
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
•Hadoop components
•HDFS – Storage layer
•MapReduce – Processing layer
•YARN – Resource management layer

HDFS
•Stores different types of large data sets (structured, semi-structured
and unstructured)
•HDFS creates a level of abstraction over the resources, from where we
can see the whole HDFS as a single unit.
•Stores data across various resources and maintains the log file about
the stored data.(metadata)

MapReduce
•Core component in Hadoop ecosystem for processing
•Two functions
•Map
•Reduce

Kafka
• Distributed publish-subscribe messaging system and a robust queue
that can handle a high volume of data and enables to pass messages
from one end-point to another
• Built on top of the ZooKeeper synchronization service
• Integrates well with Apache Storm and Spark for real-time streaming
data analysis

Sqoop
• Sqoop − SQL to Hadoop and Hadoop to SQL
• Tool designed to transfer data between Hadoop and relational
database servers

YARN
• Performs all processing activities by allocating resources
• YARN is an attempt to take Apache Hadoop beyond MapReduce for
data-processing
• Consists
• Resource manager
• Node manager

Hive
•Data warehouse software
•Provide data summarization, query and analysis
•Query language – Hive Query language (HQL)
•Uses metastore to store meta-data about the data
•Familiar built in user defined functions

Pig
• Write complex MapReduce transformations using a simple scripting
language called Pig Latin
• Pig translates the Pig Latin script into MapReduce
• Makes Hadoop data accessible for a variety of batch processing
workloads
• Data preparation
• ETL
• Data mining

Impala
• It is an interactive SQL like query engine that runs on top of Hadoop
Distributed File System (HDFS)
• Parallel processing SQL query engine for processing huge volume of
data
• Provide unified platform for real time queries.
• Impala is faster than Apache Hive.
• Impala is memory intensive and does not run effectively for heavy
data operations like joins.

HBase
• Wide column store database (NoSQL).
• Database built in top of the HDFS
• HBase does not support a structured query language like SQL
• Provides random, real time access to data in Hadoop

Spark
• Open-source distributed general-purpose cluster computing
framework with in-memory data processing engine
• It can run in Hadoop clusters through YARN and it can process data in
HDFS .
• Fast and general engine for large-scale data processing.
• High level API's for programming languages: Java, Python, Scarla, R
• Supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.

Mahout
• Use for creating scalable machine learning algorithms
• Implemented on top of Apache Hadoop and using the MapReduce
paradigm.
• Lets applications to analyze large sets of data effectively and in
quicktime
• Mahout provides the data science tools to automatically find
meaningful patterns in big data sets in HDFS.

SolR
• Open source platform for searches of data stored in HDFS
• Advanced full-text search
• Near real-time indexing
• Standards based upon interfaces like JSON, XML, HTTP
• Comprehensive HTML administration interfaces

Kudu
• Apache Kudu completes Hadoop's storage layer to enable fast
analytics on fast data.
• It runs on commodity hardware, is horizontally scalable, and supports
highly available operation.
• Integration with MapReduce, Spark and other Hadoop ecosystem
components.
• Strong performance for running sequential and random workloads
simultaneously

Sentry
• Granular, role-based authorization module for Hadoop
• Provides the ability to control and enforce precise levels of privileges
on data for authenticated users and applications on a Hadoop cluster.
• Designed to be a pluggable authorization engine for Hadoop
components.
• Allows to define authorization rules to validate a user

Cloudera Hadoop Distribution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloudera Hadoop Distribution

Similar to Cloudera Hadoop Distribution (20)

Recently uploaded

Recently uploaded (20)

Cloudera Hadoop Distribution

Editor's Notes