Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics

Strata Conference, Sept 22nd 2011, New York, NY

Dr. Amr Awadallah, Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah

Business Intelligence Before Adopting Apache Hadoop

BI Reports + Interactive Apps Can’t Explore Original
High Fidelity Raw Data
RDBMS (processed data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)
Archiving =
Mostly Append
Premature
Collection Data Death
Instrumentation

Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2

Business Intelligence After Adopting Apache Hadoop
Data Exploration &
BI Reports + Interactive Apps Advanced Analytics

RDBMS

ETL and Aggregations Complex Data Processing
Hadoop: Storage + Compute Grid
Mostly Append Keep Data Alive For Ever
Collection
Instrumentation


So What is Apache Hadoop?
• A scalable fault-tolerant distributed system for data storage and
processing (open source under the Apache license)

• Core Hadoop has two main components:
• Hadoop Distributed File System: self-healing high-bandwidth clustered storage
• MapReduce: fault-tolerant distributed processing

• Key business values:
• Flexible – Store any data, Run any analysis (Mine First, Govern Later)
• Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes
• Affordable – Cost per TB at a fraction of traditional options
• Open Source – No Lock-In, Rich Ecosystem, Large developer community
• Broadly adopted – A large and active ecosystem, Proven to run at scale


The Main Benefit: Agility/Flexibility

Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
data is loaded store, no special transformation is
needed
• Explicit load operation has to
take place which transforms data • A SerDe (Serializer/Deserlizer) is
to database internal structure applied during read time to extract
the required columns
• New columns must be added
explicitly before data for such • New data can start flowing
columns can be loaded into the anytime and will appear
database retroactively once the SerDe is
updated to parse them
• Read is Fast • Load is Fast
Benefits
• Standards/Governance • Flexibility/Agility


What is Complex Data Processing?
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
4. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
5. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.


What This Means For You: Agility

Up Front Design Just in Time


What This Means For You: Innovation

Data Committee Data Scientist


What This Means For You: Consolidation

Silos Sharing


What This Means For You: Extract Value from Latent Data

Archive to Tape Keep Data Alive


What This Means For You: Ability to Grow Fluidly
Benefit #2: Scalability


What This Means For You: Data Beats Algorithm

Smarter Algos More Data


Where Does Hadoop Fit in the Enterprise Data Stack?
Data Scientists Analysts Business Users

Enterprise
IDEs BI, Analytics
System Reporting
Operators
Development Tools Business Intelligence Tools

Cloudera
Mgmt Suite Enterprise
Data
Data
ETL Tools

Architects Warehouse Customers

Low-Latency Web
Serving Application

Relational Systems
Logs Files Web Data
Databases


Use The Right Tool For The Right Job

Relational Databases: Hadoop:

Use when: Use when:
• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)
• Multistep ACID Transactions • Scalability of Storage/Compute
• 100% SQL Compliance • Complex Data Processing

Two Core Use Cases Common Across Many Industries

Use Case Application Industry Application Use Case
Social Network Analysis Web Clickstream Sessionization
ADVANCED ANALYTICS

Media

DATA PROCESSING
Content Optimization Clickstream Sessionization

Network Analytics Telco Mediation

Loyalty & Promotions Retail Data Factory

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

Product Quality Manufacturing Mfg Process Tracking


CDH: Cloudera’s Distribution Including Apache Hadoop
UI Framework HUE SDK HUE SDK

Workflow OOZIE Scheduling OOZIE Metadata HIVE

Languages / Compilers
PIG, HIVE Fast Read/Write
Data Integration
Access
FLUME, SQOOP, ODBC HBASE

Coordination ZOOKEEPER

• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.
• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA
• Integrated – All required component versions & dependencies are managed for you
• Industry Standard – Existing RDBMS, ETL and BI systems work best with it
• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc


SCM Express: Simplifies Installation and Configuration

Service & Configuration Manager
(SCM) Express takes the complexity out of
deploying and configuring CDH.

 Provision a complete Hadoop stack in minutes
 Centrally manage system services through a user-
friendly interface
 Manages services for up to 50 nodes
 FREE to download

KEY FEATURES
Automated, wizard-based Central, real-time Ability to configure the Incorporates Automates the expansion
installation of the dashboard for cluster while it’s running comprehensive validation of services to new nodes
complete Hadoop stack configuration and error checking when they come online
management

1 2 3 4 5
©2011 Cloudera, Inc. All Rights Reserved. 17

What is Cloudera Enterprise?

Cloudera Enterprise makes open source CLOUDERA ENTERPRISE COMPONENTS
Apache Hadoop enterprise-easy
Cloudera Production-Level
 Simplify and Accelerate Hadoop Deployment
Management Suite Support
 Reduce Adoption Costs and Risks
 Lower the Cost of Administration Comprehensive Our Team of Experts
Toolset for Hadoop On-Call to Help You
 Increase the Transparency & Control of Hadoop
Administration Meet Your SLAs
 Leverage the Experience of Our Experts

3 of the top 5 telecommunications, mobile services, defense & intelligence,
banking, media and retail organizations depend on Cloudera Enterprise

EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production


Hadoop World 2011

The largest gathering of Hadoop practitioners, developers,
business executives, industry luminaries and innovative
companies in the Hadoop ecosystem.

• 1400 attendees, 25+ sponsors
November 8-9
• 60 sessions across 5 tracks for:
Sheraton New York Hotel
– Business Decision Makers & Towers, NYC
– Enterprise Architects
– IT Operators Learn more and register at
– Data Scientists www.hadoopworld.com
– Developers
• Cloudera Training and Certification $50 discount for
(November 7, 10, 11) Strata attendees


What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
• Agility/Flexibility (Enables Innovation/Exploration).
• Complex Data Processing (Any Language, Any Problem).
• Scalability of Storage/Compute (Freedom to Grow).
• Economical Active Archive (Keep All Your Data Alive).

• Cloudera Enterprise enables:
• Lower the Cost of Management and Administration.
• Simplify and Accelerate Hadoop Deployment.
• Increase the Transparency & Control of Hadoop.
• Firm SLAs on Issue Resolution.

Contact Information:

Amr Awadallah
aaa@cloudera.com
650-644-3921
http://twitter.com/awadallah


Appendix


Hadoop Timeline

Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch • Fastest sort of a TB, 62secs
over 1,460 nodes
NY Times converts 4TB of • Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella over 3,658 nodes
image archives over 100 EC2s
started working on Nutch

2002 2003 2004 2005 2006 2007 2008 2009

Google publishes GFS &
Yahoo! hires Cutting, Cloudera Doug Cutting
MapReduce papers
Hadoop spins out of Nutch Founded joins Cloudera

Facebooks launches Hive:
SQL Support for Hadoop
Hadoop Summit 2009,
750 attendees


Cloudera’s Track Record
• Customers: Multiple customers with >1,000 Hadoop nodes under management
• Supporting dozens of diverse production use cases including ones that are revenue critical
with tight SLA’s

• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.
Cloudera employees are:
• The largest contributor to the Hadoop ecosystem in patches
• Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache
Hadoop itself
• The first to build & integrate what is now the reference Hadoop stack

• Industry: Multiple years of experience providing Hadoop solutions across industries:
• 2 of the top 5 payments companies run Cloudera
• 3 of the top 5 commerical banks run Cloudera
• 2 of the top 4 online travel companies run Cloudera


Cloudera Enterprise Management Suite

Utility It Helps You… So You Can… It’s Like…
Activity Monitor • Consolidate all user activities
into a real-time view
• Improve performance • MySQL Enterprise Monitor
• Improve conformance to • Quest Foglight for Oracle /
• Diagnose user performance SLAs SQL Server
• Track activity metrics • Improve QOS

Service & • Manage system services • Lower cost of administration • Red Hat Satellite Server
• Automate changes • Improve uptime • Microsoft System Center
Configuration • Validate settings • Oracle Enterprise Manager
Manager • 1-click security

Resource • Report on the usage of
scarce resources
• Improve quality of service • VMware vCenter
• Extend the life of the cluster
Manager • Plan for capacity expansion

Authorization • Centralize management of all
users, groups and privileges
• Lower the costs of
administration
• Teradata security
administration
Manager • Manage permissions via • Improve compliance
delegated administration


CDH Integrates with Existing IT Infrastructure

BI/Analytics ETL Databases Cloud/OS Hardware


Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

More Related Content

What's hot

Similar to Business Intelligence and Data Analytics Revolutionized with Apache Hadoop

More from Cloudera, Inc.

Recently uploaded

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop