SlideShare a Scribd company logo
1 of 54
1
Cloudera Certified Developer
for Apache Hadoop (CCDH)
Who We Are
2
How We Do It
We deliver relevant
products and services.
 A distribution of Apache Hadoop
that is tested, certified and
supported
 Comprehensive support and
professional service offerings
 A suite of management software
for Hadoop operations
 Training and certification
programs for developers,
administrators, managers and
data scientists
Technical Team
Unmatched knowledge
and experience.
 Founders, committers and
contributors to Hadoop
 A wealth of experience in the
design and delivery of production
software
Credentials
The Apache Hadoop
experts.
 Number 1 distribution of Apache
Hadoop in the world
 Largest contributor to the open
source Hadoop ecosystem
 More committers on staff than
any other company
 More than 100 customers across
a wide variety of industries
 Strong growth in revenue and
new accounts
Mission: To help organizations profit from their data
Leadership
Strong executive team
with proven abilities.
Mike Olson
CEO
Kirk Dunn
COO
Charles
Zedlewski
VP, Product
Mary
Rorabaugh
CFO
Jeff
Hammerbacher
Chief Scientist
Amr Awadalla
VP Engineering
Doug Cutting
Chief Architect
Omer Trajman
VP, Customer
Solutions
Users of Cloudera
3
Financial Web Retail &
Consumer
MediaTelecom
What is Apache Hadoop?
4
Hadoop
Distributed File
System (HDFS)
File Sharing & Data
Protection Across
Physical Servers
MapReduce
Distributed Computing
Across Physical Servers
Flexibility
 A single repository for storing
processing & analyzing any type
of data
 Not bound by a single schema
Scalability
 Scale-out architecture divides
workloads across multiple
nodes
 Flexible file system eliminates
ETL bottlenecks
Low Cost
 Can be deployed on commodity
hardware
 Open source platform guards
against vendor lock
Hadoop is a platform for data
storage and processing that is…
 Scalable
 Fault tolerant
 Open source
CORE HADOOP COMPONENTS
What Makes Hadoop Different?
• Ability to scale out to Petabytes in size using commodity
hardware
• Processing (MapReduce) jobs are sent to the data versus
shipping the data to be processed
• Hadoop doesn’t impose a single data format so it can
easily handle structure, semi-structure and unstructured
data
• Manages fault tolerance and data replication
automatically
5
Why the Need for Hadoop?
6
10,000
2005 20152010
5,000
0
1.8 trillion gigabytes of data was
created in 2011…
 More than 90% is unstructured data
 Approx. 500 quadrillion files
 Quantity doubles every 2 years
STRUCTURED DATA UNSTRUCTURED DATA
GIGABYTESOFDATACREATED(INBILLIONS)
Source: IDC 2011
Hadoop Use Cases
7
ADVANCEDANALYTICS
DATAPROCESSING
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions
Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application ApplicationIndustry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome MappingSequencing Analysis
Use CaseUse Case
Hadoop in the Enterprise
8
Logs Files Web Data
Relational
Databases
IDE’s BI / Analytics
Enterprise
Reporting
Enterprise Data
Warehouse
Web
Application
Management
Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
CUSTOMERS
What is CDH?
9
Fastest Path to Success
 No need to write your own scripts or
do integration testing on different
components
 Works with a wide range of operating
systems, hardware, databases and
data warehouses
Stable and Reliable
 Extensive Cloudera QA systems,
software & processes
 Tested & run in production at scale
 Proven at scale in dozens of
enterprise environments
Community Driven
 Incorporates only main-line
components from the Apache
Hadoop ecosystem – no forks or
proprietary underpinnings
 FREE
Cloudera’s Distribution Including
Apache Hadoop (CDH) is an enterprise-ready
distribution of Hadoop that is…
 100% Apache open source
 Contains all components needed for deployment
 Fully documented and supported
 Released on a reliable schedule
10
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes #1
HDFS 6 Yes #2
MapReduce 5 Yes #1
HBase 2 No #2
Zookeeper 1 Yes #2
Oozie 1 Yes #1
Pig 0 No #3
Hive 1 No #2
Sqoop 2 Yes #1
Flume 3 Yes #1
Hue 3 Yes #1
Snappy 2 No #1
Bigtop 8 Yes #1
Avro 4 Yes #1
Whirr 2 Yes #1
Cloudera’s Commitment to the Open
Source Community
Components of CDH
11
Coordination
Data Integration
Fast Read/Write
Access
Languages / Compilers
Workflow Scheduling
APACHE ZOOKEEPER
APACHE FLUME, APACHE SQOOP
APACHE HBASE
APACHE PIG, APACHE HIVE
APACHE OOZIE APACHE OOZIE
File System Mount
User Interface
FUSE-DFS
HUE
Cloudera Enterprise
Block Size = 64MB
Replication Factor = 3
Hadoop Distributed File System
Cost is $400-$500/TB
12
1
2
3
4
5 2
3
4
5
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
Components of Hadoop
• NameNode – Holds all metadata for HDFS
– Needs to be a highly reliable machine
• RAID drives – typically RAID 10
• Dual power supplies
• Dual network cards – Bonded
– The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be
used
13
Components of Hadoop
• DataNodes – Hardware will depend on the specific needs
of the cluster
– No RAID needed, JBOD (just a bunch of disks) is used
– Typical ratio is:
• 1 hard drive
• 2 cores
• 4GB of RAM
14
Networking
• One of the most important things to consider when setting
up a Hadoop cluster
• Typically a top of rack is used with Hadoop with a core
switch
• Careful on over subscribing the backplane of the switch!
15
Map
16
• Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value
pairs: e.g., (filename, line).
• map() produces one or more intermediate values along
with an output key from the input.
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
Reduce
17
• After the map phase is over, all the intermediate values for
a given output key are combined together into a list
• reduce() combines those intermediate values into one or
more final values for that same output key
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
MapReduce Execution
18
Sqoop
19
SQL to Hadoop
 Tool to import/export any JDBC-supported database into Hadoop
 Transfer data between Hadoop and external databases or EDW
 High performance connectors for some RDBMS
 Developed at Cloudera
Flume
20
Distributed, reliable, available service for efficiently moving
large amounts of data as it is produced
 Suited for gathering logs from multiple systems
 Inserting them into HDFS as they are generated
Design goals
 Reliability, Scalability, Manageability, Extensibility
Developed at Cloudera
Flume: high-level architecture
Agent Agent Agent
Processor Processor
Collector(s)
Agent
Configurable levels of reliability
Guarantee delivery in event of
failure
Deployable, centrally administered
compress
encrypt
batch
encrypt
Flexibly deploy decorators at any
step to improve performance,
reliability or security
Optionally pre-process incoming
data: perform transformations,
suppressions, metadata enrichment
Writes to multiple HDFS file formats
(text, sequence, JSON, Avro, others)
Parallelized writes across many
collectors – as much write throughput
as
MASTER
Master send
configuration to all
Agents
21
HBase
22
Column-family store. Based on design of Google BigTable
 Provides interactive access to information
 Holds extremely large datasets (multi-TB)
 Constrained access model
 (key, value) lookup
 Limited transactions (only one row)
HBase
23
Hive
24
SQL-based data warehousing application
 Language is SQL-like
 Supports SELECT, JOIN, GROUP BY, etc.
 Features for analyzing very large data sets
 Partition columns, Sampling, Buckets
 Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;
Pig
25
Data-flow oriented language – “Pig latin”
 Datatypes include sets, associative arrays, tuples
 High-level language for routing data, allows easy
integration of Java for complex tasks
 Example:
emps=LOAD 'people.txt’ AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO ’
rich_people.txt';
Oozie
26
Oozie is a workflow/cordination service to manage data
processing jobs for Hadoop
Zookeeper
27
Zookeeper is a distributed consensus engine
 Provides well-defined concurrent access semantics:
 Leader election
 Service discovery
 Distributed locking / mutual exclusion
 Message board / mailboxes
Pipes and Streaming
28
Multi-language connector libraries for MapReduce
 Write native-code MapReduce in C++
 Write MapReduce passes in any scripting language,
including
 Perl
 Python
FUSE - DFS
29
Allows mounting of HDFS volumes via Linux FUSE file
system
 Does allow easy integration with other systems for data
import/export
 Does not imply HDFS can be used for general-purpose
file system
Hadoop Security
30
 Authentication is secured by Kerberos v5 and integrated with LDAP
 Hadoop server can ensure that users and groups are who they say they are
 Job Control includes Access Control Lists, which means Jobs can specify who
can view logs, counters, configurations and who can modify a job
 Tasks now run as the user who launched the job
Cloudera Enterprise
31
 Simplify and Accelerate Hadoop Deployment
 Reduce Adoption Costs and Risks
 Lower the Cost of Administration
 Increase the Transparency Control of Hadoop
 Leverage the Experience of Our Experts
Cloudera Enterprise makes
open source Hadoop enterprise-easy
EFFECTIVENESS
Ensuring You
Get Value From Your Hadoop Deployment
EFFICIENCY
Enabling You to
Affordably Run Hadoop in Production
Cloudera
Manager
End-to-End Management
Application for Apache
Hadoop
Production-Level
Support
Our Team of Experts On-
Call to Help You Meet
Your SLAs
CLOUDERA ENTERPRISE COMPONENTS
Cloudera Manager
32
The industry’s first
for Apache Hadoop
the
Apache Hadoop stack
Automates the
of Apache Hadoop
DISCOVER DIAGNOSE OPTIMIZEACT
HDFS MAPREDUCE HBASE
ZOOKEEPER OOZIE HUE
Cloudera Enterprise
34
Including Cloudera Support
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and
Escalation Processes
Proven processes ensure that support cases get
resolved with maximum efficiency
Comprehensive
Knowledgebase
Browse through hundreds of Articles and Tech Notes
to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing
data analysis tools such as IBM Netezza and
Revolution Analytics
Notification of New
Developments and Events
Stay up to speed with what’s going on in the Apache
Hadoop community
Cloudera University
35
Public and Private Training to Enable Your Success
Class Description
Developer Training & Certification
(4 Days)
Hands-on training and certification for developers who want
to analyze their data but are new to Apache Hadoop
System Administrator Training &
Certification (3 Days)
Hands-on training and certification for administrators who
will be responsible for setting up, configuring, monitoring an
Apache Hadoop cluster
HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as
well as some advanced topics and best practices
Analyzing Data with Hive and Pig
(2 Days)
Hive and Pig training is designed for people who have a
basic understanding of how Apache Hadoop works and want
to utilize these languages for analysis of their data
Essentials for Managers (1 Day) Provides decision-makers the information they need to know
about Apache Hadoop, answering questions such as “when
is Hadoop appropriate?”, “what are people using Hadoop
for?” and “what do I need to know about choosing Hadoop?”
Cloudera Consulting Services
36
Put Our Expertise To Work For You.
Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop
for your organization
New Hadoop Deployment Set up and configure high performance,
production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project
feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using
Hadoop
Process and Team Development Define the requirements and processes for
creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune
up existing Hadoop clusters
Cloudera’s team of Solutions Architects provides guidance and
hands-on expertise to address unique enterprise challenges.
Journey of the Cloudera Customer
37
Discover the Benefits
of Apache Hadoop
Cloudera’s
Distribution
Subscribe to
Cloudera Enterprise
Flexibility to store
and mine all types
of data
The fastest, surest
path to success with
Apache Hadoop
Simplify and
accelerate Apache
Hadoop deployment
Cloudera in Production
38
Logs Files Web Data
Relational
Databases
IDE’s BI / Analytics
Enterprise
Reporting
Enterprise Data
Warehouse
Operational Rules
Engines
Management
Tools
OPERATORS ENGINEERS ANALYSTS BUSINESS USERS
Cloudera’s Distribution
Including Apache Hadoop (CDH)
&
SCM Express
Cloudera Enterprise
 Cloudera Management Suite
 Cloudera Support
Cloudera Services
 Consulting Services
 Cloudera University
Web
Application
CUSTOMERS
39
Cloudera helps you profit
from all your data.
cloudera.com+1 (888) 789-1488
sales@cloudera.com
twitter.com/
cloudera
facebook.com/
cloudera
Get
Hadoop
Cloudera Manager
40
The Hadoop management
application that:
Manages the
Manages and monitors the
Incorporates comprehensive
Has built-in
Cloudera Manager
41
Key and
Installs the complete Hadoop stack in minutes. The simple, wizard-based
interface guides you through the steps.
Gives you complete, end-to-end visibility and control over your Hadoop
cluster from a single interface
Set server roles, configure services and manage security across the cluster
Gracefully start, stop and restart of services as needed
Maintains a complete record of configuration changes for SOX compliance
Monitors dozens of service performance metrics and alerts you when you
approach critical thresholds
Gather, view and search Hadoop logs collected from across the cluster
Scans Hadoop logs for irregularities and warns you before they impact the
cluster
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
Key and
Cloudera Manager
42
Establishes the time context globally for almost all views
Correlates jobs, activities, logs, system changes, configuration changes and
service metrics along a single timeline to simplify diagnosis
Takes a snapshot of the cluster state and automatically sends it to Cloudera
support to assist with resolution
Creates and aggregates relevant Hadoop events pertaining to system health, log
messages, user services and activities and make them available for alerting and
searching
Generates email alerts when certain events occur
Visualize current and historical disk usage by user, group and directory
Track MapReduce activity on the cluster by job or user
View information pertaining to hosts in your cluster including status, resident
memory, virtual memory and roles
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
43
Max Number of Nodes Supported 50 Unlimited
Automated Deployment
Host-Level Monitoring
Secure Communication Between Server & Agents
Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper
Audit Trails
Start/Stop/Restart Services
Add/Restart/Decomission Role Instances
Configuration Versioning & History
Support for Kerberos
Service Monitoring
Proactive Health Checks
Status & Health Summary
Intelligent Log Management
Events Management & Alerts
Activity Monitoring
Operational Reporting
Global Time Control
Support Integration
FREE EDITION ENTERPRISE EDITION**
Two Editions:
** Part of the Cloudera Enterprise subscription
44
View Service Health and Performance
45
Get Host-Level Snapshots
46
Monitor and Diagnose Cluster Workloads
47
Gather, View and Search Hadoop Logs
48
Track Events From Across the Cluster
49
Run Reports on System Performance & Usage
New in Cloudera Manager 3.7
50
Proactive Health Checks Monitors dozens of service performance metrics and alerts you
when you approach critical thresholds
Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you
before they impact the cluster
Global Time Control Correlates jobs, activities, logs, system changes, configuration
changes and service metrics along a single timeline to simplify
diagnosis
Support Integration Takes a snapshot of the cluster state and automatically sends it to
Cloudera support to assist with resolution
Event Management Creates and aggregates relevant Hadoop events pertaining to
system health, log messages, user services and activities and make
them available for alerting and searching
Alerts Generates email alerts when certain events occur
Audit Trails Maintains a complete record of configuration changes for SOX
compliance
Operational Reporting Visualize current and historical disk usage by user, group and
directory and track MapReduce activity on the cluster by job or user
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
ONLY
CLOUDERA
Cloudera Support
51
Our on call to help you meet your SLAs
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA
requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your
environment
Issue Resolution and Escalation
Processes
Proven processes ensure that support cases get
resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes
to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing
data analysis tools such as IBM Netezza, Revolution
Analytics, and MicroStrategy
Proactive Notification of New
Developments and Events
Stay up to speed with what’s going on in the Apache
Hadoop community
Cloudera Enterprise
52
Why Cloudera Enterprise?
 Apache Hadoop is a distributed system that
presents unique operational challenges
 The fixed cost of managing an internal patch
and release infrastructure is prohibitive
 Apache Hadoop skills and expertise are scarce
 It’s challenging to track consistently to
community development efforts
Only Cloudera Enterprise
Has a management application that
supports the full lifecycle of operationalizing
Apache Hadoop
• • •
Has production support backed by the
Apache committers
• • •
Has the depth of experience supporting
hundreds of production Apache Hadoop clusters
The Fastest Path to Success
Running Apache Hadoop in Production.
Block Size = 64MB
Replication Factor = 3
Hadoop Distributed File System
Cost is $400-$500/TB
53
MapReduce: Distributed Processing
54
Thank you.

More Related Content

What's hot

Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
 

What's hot (20)

What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 

Similar to CCDH Certification Guide

Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewRajan Kanitkar
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Andrew Underwood
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similar to CCDH Certification Guide (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Sureh hadoop 3 years t
Sureh hadoop 3 years tSureh hadoop 3 years t
Sureh hadoop 3 years t
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
paper
paperpaper
paper
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Recently uploaded

Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 

Recently uploaded (20)

ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 

CCDH Certification Guide

  • 1. 1 Cloudera Certified Developer for Apache Hadoop (CCDH)
  • 2. Who We Are 2 How We Do It We deliver relevant products and services.  A distribution of Apache Hadoop that is tested, certified and supported  Comprehensive support and professional service offerings  A suite of management software for Hadoop operations  Training and certification programs for developers, administrators, managers and data scientists Technical Team Unmatched knowledge and experience.  Founders, committers and contributors to Hadoop  A wealth of experience in the design and delivery of production software Credentials The Apache Hadoop experts.  Number 1 distribution of Apache Hadoop in the world  Largest contributor to the open source Hadoop ecosystem  More committers on staff than any other company  More than 100 customers across a wide variety of industries  Strong growth in revenue and new accounts Mission: To help organizations profit from their data Leadership Strong executive team with proven abilities. Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Jeff Hammerbacher Chief Scientist Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions
  • 3. Users of Cloudera 3 Financial Web Retail & Consumer MediaTelecom
  • 4. What is Apache Hadoop? 4 Hadoop Distributed File System (HDFS) File Sharing & Data Protection Across Physical Servers MapReduce Distributed Computing Across Physical Servers Flexibility  A single repository for storing processing & analyzing any type of data  Not bound by a single schema Scalability  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Low Cost  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop is a platform for data storage and processing that is…  Scalable  Fault tolerant  Open source CORE HADOOP COMPONENTS
  • 5. What Makes Hadoop Different? • Ability to scale out to Petabytes in size using commodity hardware • Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed • Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data • Manages fault tolerance and data replication automatically 5
  • 6. Why the Need for Hadoop? 6 10,000 2005 20152010 5,000 0 1.8 trillion gigabytes of data was created in 2011…  More than 90% is unstructured data  Approx. 500 quadrillion files  Quantity doubles every 2 years STRUCTURED DATA UNSTRUCTURED DATA GIGABYTESOFDATACREATED(INBILLIONS) Source: IDC 2011
  • 7. Hadoop Use Cases 7 ADVANCEDANALYTICS DATAPROCESSING Social Network Analysis Content Optimization Network Analytics Loyalty & Promotions Analysis Fraud Analysis Entity Analysis Clickstream Sessionization Clickstream Sessionization Mediation Data Factory Trade Reconciliation SIGINT Application ApplicationIndustry Web Media Telco Retail Financial Federal Bioinformatics Genome MappingSequencing Analysis Use CaseUse Case
  • 8. Hadoop in the Enterprise 8 Logs Files Web Data Relational Databases IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Web Application Management Tools OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS
  • 9. What is CDH? 9 Fastest Path to Success  No need to write your own scripts or do integration testing on different components  Works with a wide range of operating systems, hardware, databases and data warehouses Stable and Reliable  Extensive Cloudera QA systems, software & processes  Tested & run in production at scale  Proven at scale in dozens of enterprise environments Community Driven  Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings  FREE Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…  100% Apache open source  Contains all components needed for deployment  Fully documented and supported  Released on a reliable schedule
  • 10. 10 Component Cloudera Committers Cloudera Founder 2011 Commits Common 6 Yes #1 HDFS 6 Yes #2 MapReduce 5 Yes #1 HBase 2 No #2 Zookeeper 1 Yes #2 Oozie 1 Yes #1 Pig 0 No #3 Hive 1 No #2 Sqoop 2 Yes #1 Flume 3 Yes #1 Hue 3 Yes #1 Snappy 2 No #1 Bigtop 8 Yes #1 Avro 4 Yes #1 Whirr 2 Yes #1 Cloudera’s Commitment to the Open Source Community
  • 11. Components of CDH 11 Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE OOZIE File System Mount User Interface FUSE-DFS HUE Cloudera Enterprise
  • 12. Block Size = 64MB Replication Factor = 3 Hadoop Distributed File System Cost is $400-$500/TB 12 1 2 3 4 5 2 3 4 5 2 4 5 1 3 5 1 2 5 1 3 4 HDFS
  • 13. Components of Hadoop • NameNode – Holds all metadata for HDFS – Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded – The more memory the better – typical 36GB to - 64GB • Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13
  • 14. Components of Hadoop • DataNodes – Hardware will depend on the specific needs of the cluster – No RAID needed, JBOD (just a bunch of disks) is used – Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM 14
  • 15. Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch! 15
  • 16. Map 16 • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 17. Reduce 17 • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 19. Sqoop 19 SQL to Hadoop  Tool to import/export any JDBC-supported database into Hadoop  Transfer data between Hadoop and external databases or EDW  High performance connectors for some RDBMS  Developed at Cloudera
  • 20. Flume 20 Distributed, reliable, available service for efficiently moving large amounts of data as it is produced  Suited for gathering logs from multiple systems  Inserting them into HDFS as they are generated Design goals  Reliability, Scalability, Manageability, Extensibility Developed at Cloudera
  • 21. Flume: high-level architecture Agent Agent Agent Processor Processor Collector(s) Agent Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered compress encrypt batch encrypt Flexibly deploy decorators at any step to improve performance, reliability or security Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as MASTER Master send configuration to all Agents 21
  • 22. HBase 22 Column-family store. Based on design of Google BigTable  Provides interactive access to information  Holds extremely large datasets (multi-TB)  Constrained access model  (key, value) lookup  Limited transactions (only one row)
  • 24. Hive 24 SQL-based data warehousing application  Language is SQL-like  Supports SELECT, JOIN, GROUP BY, etc.  Features for analyzing very large data sets  Partition columns, Sampling, Buckets  Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;
  • 25. Pig 25 Data-flow oriented language – “Pig latin”  Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks  Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt';
  • 26. Oozie 26 Oozie is a workflow/cordination service to manage data processing jobs for Hadoop
  • 27. Zookeeper 27 Zookeeper is a distributed consensus engine  Provides well-defined concurrent access semantics:  Leader election  Service discovery  Distributed locking / mutual exclusion  Message board / mailboxes
  • 28. Pipes and Streaming 28 Multi-language connector libraries for MapReduce  Write native-code MapReduce in C++  Write MapReduce passes in any scripting language, including  Perl  Python
  • 29. FUSE - DFS 29 Allows mounting of HDFS volumes via Linux FUSE file system  Does allow easy integration with other systems for data import/export  Does not imply HDFS can be used for general-purpose file system
  • 30. Hadoop Security 30  Authentication is secured by Kerberos v5 and integrated with LDAP  Hadoop server can ensure that users and groups are who they say they are  Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job  Tasks now run as the user who launched the job
  • 31. Cloudera Enterprise 31  Simplify and Accelerate Hadoop Deployment  Reduce Adoption Costs and Risks  Lower the Cost of Administration  Increase the Transparency Control of Hadoop  Leverage the Experience of Our Experts Cloudera Enterprise makes open source Hadoop enterprise-easy EFFECTIVENESS Ensuring You Get Value From Your Hadoop Deployment EFFICIENCY Enabling You to Affordably Run Hadoop in Production Cloudera Manager End-to-End Management Application for Apache Hadoop Production-Level Support Our Team of Experts On- Call to Help You Meet Your SLAs CLOUDERA ENTERPRISE COMPONENTS
  • 32. Cloudera Manager 32 The industry’s first for Apache Hadoop the Apache Hadoop stack Automates the of Apache Hadoop DISCOVER DIAGNOSE OPTIMIZEACT HDFS MAPREDUCE HBASE ZOOKEEPER OOZIE HUE
  • 33. Cloudera Enterprise 34 Including Cloudera Support Feature Benefit Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community
  • 34. Cloudera University 35 Public and Private Training to Enable Your Success Class Description Developer Training & Certification (4 Days) Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop System Administrator Training & Certification (3 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices Analyzing Data with Hive and Pig (2 Days) Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?”
  • 35. Cloudera Consulting Services 36 Put Our Expertise To Work For You. Service Description Use Case Discovery Assess the appropriateness and value of Hadoop for your organization New Hadoop Deployment Set up and configure high performance, production-ready Hadoop clusters Proof of Concept Verify the prototype functionality and project feasibility for a new Hadoop cluster Production Pilot Deploy your first production-level project using Hadoop Process and Team Development Define the requirements and processes for creating a new Hadoop team Hadoop Deployment Certification Perform periodic health checks to certify and tune up existing Hadoop clusters Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.
  • 36. Journey of the Cloudera Customer 37 Discover the Benefits of Apache Hadoop Cloudera’s Distribution Subscribe to Cloudera Enterprise Flexibility to store and mine all types of data The fastest, surest path to success with Apache Hadoop Simplify and accelerate Apache Hadoop deployment
  • 37. Cloudera in Production 38 Logs Files Web Data Relational Databases IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Operational Rules Engines Management Tools OPERATORS ENGINEERS ANALYSTS BUSINESS USERS Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express Cloudera Enterprise  Cloudera Management Suite  Cloudera Support Cloudera Services  Consulting Services  Cloudera University Web Application CUSTOMERS
  • 38. 39 Cloudera helps you profit from all your data. cloudera.com+1 (888) 789-1488 sales@cloudera.com twitter.com/ cloudera facebook.com/ cloudera Get Hadoop
  • 39. Cloudera Manager 40 The Hadoop management application that: Manages the Manages and monitors the Incorporates comprehensive Has built-in
  • 40. Cloudera Manager 41 Key and Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed Maintains a complete record of configuration changes for SOX compliance Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA
  • 41. Key and Cloudera Manager 42 Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA
  • 42. 43 Max Number of Nodes Supported 50 Unlimited Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration FREE EDITION ENTERPRISE EDITION** Two Editions: ** Part of the Cloudera Enterprise subscription
  • 43. 44 View Service Health and Performance
  • 45. 46 Monitor and Diagnose Cluster Workloads
  • 46. 47 Gather, View and Search Hadoop Logs
  • 47. 48 Track Events From Across the Cluster
  • 48. 49 Run Reports on System Performance & Usage
  • 49. New in Cloudera Manager 3.7 50 Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster Global Time Control Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Alerts Generates email alerts when certain events occur Audit Trails Maintains a complete record of configuration changes for SOX compliance Operational Reporting Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA ONLY CLOUDERA
  • 50. Cloudera Support 51 Our on call to help you meet your SLAs Feature Benefit Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy Proactive Notification of New Developments and Events Stay up to speed with what’s going on in the Apache Hadoop community
  • 51. Cloudera Enterprise 52 Why Cloudera Enterprise?  Apache Hadoop is a distributed system that presents unique operational challenges  The fixed cost of managing an internal patch and release infrastructure is prohibitive  Apache Hadoop skills and expertise are scarce  It’s challenging to track consistently to community development efforts Only Cloudera Enterprise Has a management application that supports the full lifecycle of operationalizing Apache Hadoop • • • Has production support backed by the Apache committers • • • Has the depth of experience supporting hundreds of production Apache Hadoop clusters The Fastest Path to Success Running Apache Hadoop in Production.
  • 52. Block Size = 64MB Replication Factor = 3 Hadoop Distributed File System Cost is $400-$500/TB 53

Editor's Notes

  1. Mission: need talking points here How we do it: We do it by offering a complete set of products and services to enable our customers. Training, Services, Support and Management software We ARE the experts. We have the #1 Hadoop distribution, the most project founders, committers, and the most customers. We train thousands of people and certify many. Our services team can take you from best practices to cluster certification to fine tuning your Hbase implementation. Our tech team made up of project founders and committers and our Executive team is broad and deep across open source, web and enterprise companies.
  2. Largest cluster under management known is at FB with 21PB and 2k nodes.
  3. Apache Hadoop is a new solution in your existing infrastructure. It does not replace any existing major existing investment. Apache brings data that you’re already generating into context and integrates it with your business. You get access to key information about how your business is operating but pulling together Web and application logs Unstructured files Web data Relational data Hadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies
  4. Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks. Default block size is 64MB, though most folks now set it to 128MB
  5. Apache Hadoop Gain the flexibility to store and mine all types of data Leverage the scale-out architecture for complex data analysis Easily scale to meet growing data requirements Avoid vendor lock-in with an open source technology CDH The fastest, surest path to success with Apache Hadoop Stable, reliable version of Apache Hadoop without the vendor lock-in imposed by proprietary vendors Integrates with your other technology platforms ensuring investment protection Cloudera Enterprise Simplify and accelerate Apache Hadoop deployment Reduce adoption costs and risks More effectively manage cluster resources Leverage the experience of our experts
  6. Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks. Default block size is 64MB, though most folks now set it to 128MB
  7. Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.