CDH
What is CDH?
•Popular distribution of Apache
Hadoop and related projects.
•Delivers scalable storage and
distributed computing.
•Apache-licensed open source
Big Data
•Collection of large data sets that cannot be processed using traditional
computing techniques
•Big Data challenges
•Storage
•Capturing data
•Analyze data
Hadoop
•What is Hadoop?
• The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using
simple programming models
•Hadoop components
•HDFS – Storage layer
•MapReduce – Processing layer
•YARN – Resource management layer
Hadoop
ecosystem
HDFS
•Stores different types of large data sets (structured, semi-structured
and unstructured)
•HDFS creates a level of abstraction over the resources, from where we
can see the whole HDFS as a single unit.
•Stores data across various resources and maintains the log file about
the stored data.(metadata)
HDFS Cont...
MapReduce
•Core component in Hadoop ecosystem for processing
•Two functions
•Map
•Reduce
Kafka
• Distributed publish-subscribe messaging system and a robust queue
that can handle a high volume of data and enables to pass messages
from one end-point to another
• Built on top of the ZooKeeper synchronization service
• Integrates well with Apache Storm and Spark for real-time streaming
data analysis
Sqoop
• Sqoop − SQL to Hadoop and Hadoop to SQL
• Tool designed to transfer data between Hadoop and relational
database servers
YARN
• Performs all processing activities by allocating resources
• YARN is an attempt to take Apache Hadoop beyond MapReduce for
data-processing
• Consists
• Resource manager
• Node manager
YARN
Hive
•Data warehouse software
•Provide data summarization, query and analysis
•Query language – Hive Query language (HQL)
•Uses metastore to store meta-data about the data
•Familiar built in user defined functions
Pig
• Write complex MapReduce transformations using a simple scripting
language called Pig Latin
• Pig translates the Pig Latin script into MapReduce
• Makes Hadoop data accessible for a variety of batch processing
workloads
• Data preparation
• ETL
• Data mining
Impala
• It is an interactive SQL like query engine that runs on top of Hadoop
Distributed File System (HDFS)
• Parallel processing SQL query engine for processing huge volume of
data
• Provide unified platform for real time queries.
• Impala is faster than Apache Hive.
• Impala is memory intensive and does not run effectively for heavy
data operations like joins.
Impala
HBase
• Wide column store database (NoSQL).
• Database built in top of the HDFS
• HBase does not support a structured query language like SQL
• Provides random, real time access to data in Hadoop
Spark
• Open-source distributed general-purpose cluster computing
framework with in-memory data processing engine
• It can run in Hadoop clusters through YARN and it can process data in
HDFS .
• Fast and general engine for large-scale data processing.
• High level API's for programming languages: Java, Python, Scarla, R
• Supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.
Mahout
• Use for creating scalable machine learning algorithms
• Implemented on top of Apache Hadoop and using the MapReduce
paradigm.
• Lets applications to analyze large sets of data effectively and in
quicktime
• Mahout provides the data science tools to automatically find
meaningful patterns in big data sets in HDFS.
SolR
• Open source platform for searches of data stored in HDFS
• Advanced full-text search
• Near real-time indexing
• Standards based upon interfaces like JSON, XML, HTTP
• Comprehensive HTML administration interfaces
Kudu
• Apache Kudu completes Hadoop's storage layer to enable fast
analytics on fast data.
• It runs on commodity hardware, is horizontally scalable, and supports
highly available operation.
• Integration with MapReduce, Spark and other Hadoop ecosystem
components.
• Strong performance for running sequential and random workloads
simultaneously
Sentry
• Granular, role-based authorization module for Hadoop
• Provides the ability to control and enforce precise levels of privileges
on data for authenticated users and applications on a Hadoop cluster.
• Designed to be a pluggable authorization engine for Hadoop
components.
• Allows to define authorization rules to validate a user
Thank You

Cloudera Hadoop Distribution

  • 1.
  • 2.
    What is CDH? •Populardistribution of Apache Hadoop and related projects. •Delivers scalable storage and distributed computing. •Apache-licensed open source
  • 3.
    Big Data •Collection oflarge data sets that cannot be processed using traditional computing techniques •Big Data challenges •Storage •Capturing data •Analyze data
  • 4.
    Hadoop •What is Hadoop? •The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models •Hadoop components •HDFS – Storage layer •MapReduce – Processing layer •YARN – Resource management layer
  • 5.
  • 6.
    HDFS •Stores different typesof large data sets (structured, semi-structured and unstructured) •HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a single unit. •Stores data across various resources and maintains the log file about the stored data.(metadata)
  • 7.
  • 8.
    MapReduce •Core component inHadoop ecosystem for processing •Two functions •Map •Reduce
  • 9.
    Kafka • Distributed publish-subscribemessaging system and a robust queue that can handle a high volume of data and enables to pass messages from one end-point to another • Built on top of the ZooKeeper synchronization service • Integrates well with Apache Storm and Spark for real-time streaming data analysis
  • 10.
    Sqoop • Sqoop −SQL to Hadoop and Hadoop to SQL • Tool designed to transfer data between Hadoop and relational database servers
  • 11.
    YARN • Performs allprocessing activities by allocating resources • YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing • Consists • Resource manager • Node manager
  • 12.
  • 13.
    Hive •Data warehouse software •Providedata summarization, query and analysis •Query language – Hive Query language (HQL) •Uses metastore to store meta-data about the data •Familiar built in user defined functions
  • 14.
    Pig • Write complexMapReduce transformations using a simple scripting language called Pig Latin • Pig translates the Pig Latin script into MapReduce • Makes Hadoop data accessible for a variety of batch processing workloads • Data preparation • ETL • Data mining
  • 15.
    Impala • It isan interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS) • Parallel processing SQL query engine for processing huge volume of data • Provide unified platform for real time queries. • Impala is faster than Apache Hive. • Impala is memory intensive and does not run effectively for heavy data operations like joins.
  • 16.
  • 17.
    HBase • Wide columnstore database (NoSQL). • Database built in top of the HDFS • HBase does not support a structured query language like SQL • Provides random, real time access to data in Hadoop
  • 18.
    Spark • Open-source distributedgeneral-purpose cluster computing framework with in-memory data processing engine • It can run in Hadoop clusters through YARN and it can process data in HDFS . • Fast and general engine for large-scale data processing. • High level API's for programming languages: Java, Python, Scarla, R • Supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
  • 19.
    Mahout • Use forcreating scalable machine learning algorithms • Implemented on top of Apache Hadoop and using the MapReduce paradigm. • Lets applications to analyze large sets of data effectively and in quicktime • Mahout provides the data science tools to automatically find meaningful patterns in big data sets in HDFS.
  • 20.
    SolR • Open sourceplatform for searches of data stored in HDFS • Advanced full-text search • Near real-time indexing • Standards based upon interfaces like JSON, XML, HTTP • Comprehensive HTML administration interfaces
  • 21.
    Kudu • Apache Kuducompletes Hadoop's storage layer to enable fast analytics on fast data. • It runs on commodity hardware, is horizontally scalable, and supports highly available operation. • Integration with MapReduce, Spark and other Hadoop ecosystem components. • Strong performance for running sequential and random workloads simultaneously
  • 22.
    Sentry • Granular, role-basedauthorization module for Hadoop • Provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. • Designed to be a pluggable authorization engine for Hadoop components. • Allows to define authorization rules to validate a user
  • 23.

Editor's Notes

  • #3 only Hadoop solution to offer unified batch processing, interactive SQL and interactive search, and role-based access controls.
  • #7 fault-tolerant and self-healing distributed file-system turn a cluster of industry-standard servers into a massively scalable pool of storage.  Features Scalability Flexibility Reliability
  • #9 Accessibility Reliability Flexibility Hadoop scalable
  • #10 we have two main challenges.The first challenge is how to collect large volume of data and the second challenge is to analyze the collected data
  • #14 front end for parsing SQL statements, generating logical plans, optimizing logical plans, translating them into physical plans which are executed by MapReduce jobs.
  • #15 1 line of pig latin = approximately 100 lines of map reduce
  • #18 Nosql database Characteristics – fault tolerance(replication) , fast(near real time lookups), usable(Data model accommodates wide range of use cases)
  • #19 Apache Spark’s Streaming and SQL programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.
  • #20 SUPPORTS Collaborative filtering Clustering Classification Frequent itemset mining
  • #21 Solr is highly reliable, scalable and fault tolerant. Hadoop operators put documents in Apache Solr by “indexing” via XML, JSON, CSV or binary over HTTP. Then users can query those petabytes of data via HTTP GET. They can receive XML, JSON, CSV or binary results. Apache Solr is optimized for high volume web traffic.
  • #23 Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS 
  • #24 Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS