SlideShare a Scribd company logo
1 of 29
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks – The Hadoop Ecosystem
Fall 2014
Powering the Modern Data Architecture
Shivaji Dutta – Sr. Partner
Solutions Engineer
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
Apache Hadoop and Hortonworks Data Platform (HDP)
HDP and Couchbase
What’s new in HDP?
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Hadoop
Apache Hadoop is an open-source software framework for distributed
storage and distributed processing of very large data sets on computer
clusters built from commodity hardware.
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Projects in Hadoop
Hadoop Core
– Hadoop Common
– Hadoop Distributed File System
– Hadoop YARN
– Hadoop Mapreduce
Other Hadoop Key Projects
• Hive
• Hbase
• Spark
• Pig
• Tez
• Zookeper
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS and Yarn – The Core of Hadoop
The core components of HDP are YARN and
Hadoop Distributed Filesystem (HDFS).
YARN is the architectural center of Hadoop
that enables you to process data
simultaneously in multiple ways. YARN
provides the resource management and
pluggable architecture for enabling a wide
variety of data access methods.
HDFS provides the scalable, fault-tolerant,
cost-efficient storage for big data.
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN extends Hadoop into data center leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Access
YARN provides the foundation for a versatile
range of processing engines that empower
you to interact with the same data in multiple
ways, at the same time.
This means applications can interact with the
data in the best way: from batch to interactive
SQL or low latency access with NoSQL.
Emerging use cases for data science, search
and streaming are also supported with
Apache Spark, Solr and Storm.
Additionally, ecosystem partners provide even
more specialized data access engines for
YARN.
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance and Integration
• HDP extends data access and
management with powerful tools for data
governance and integration.
• They provide a reliable, repeatable, and
simple framework for managing the flow of
data in and out of Hadoop. This control
structure, along with a set of tooling to ease
and automate the application of schema or
metadata on sources is critical for
successful integration of Hadoop into your
modern data architecture.• Apache SQOOP
• Apache OOZIE
• Apache FALCON
• Apache FLUME
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
• Authentication/ Authorization and
Encryption
• Kerberos
• SSL & SASL
• Apache Knox
• Apache Ranger
• HDFS File/Directory Encryption
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations – Apache Ambari
• Provisioning, manage and monitor
Hadoop Clusters
• A complete set of operational
capabilities that provide both
visibilities into the health of your
cluster as well as tooling to
manage configuration and optimize
performance across all data access
methods.
• Apache Ambari provides APIs to
integrate with existing management
systems: for instance Microsoft
System Center and Teradata
ViewPoint
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be
an Enterprise Data Platform
with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into
Hadoop inherits these services
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
The Partner EcoSystemSOURCES
EXISTING
Systems
Clickstream Web &Social Geolocation Sensor &
Machine
Server Logs Unstructured
DATASYSTEM
RDBMS EDW MPP
HANA
APPLICATIONS
BusinessObjects BI
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as Microsoft, Teradata, Redhat,
HP, SAS & SAP
Broad Partnerships
Over 900 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
HDP 2.1
Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Couchbase is primarily online operational NoSQL datastore, low latency,
scalable
• Source of data and also a sink
• Example source: Pulling user profiles into Hadoop for deep analytics
• Example sink: training machine learning models that are then cached /
served from Couchbase
Couchbase and HDP
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• HDP Certified Sqoop connector for batch
mode export / import
• Couchbase Kafka connector enables both
Producer and Consumer scenarios
• Community supported Storm spout to
persist data by writing to Couchbase
Server
• Developer Preview Spark Connector
Couchbase and HDP
New!
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What’s New in HDP 2.2
New and Improved YARN
Ready Engines
• Enterprise SQL at Hadoop Scale with
Stinger.next
• Enterprise Ready Spark on YARN
• Deep YARN integration for real-time
engines: HBase, Accumulo, Storm
• Enabling ISVs with a general SDK and
API for direct YARN integration
• Only solution to provide real-time to micro
batch for analyzing the internet of things
• Other engines/tools: Solr, Cascading
Continued Innovation of
Central Enterprise Services
• Centralized security administration
and policy enforcement
• Ease of use and operations agility
features to speed cluster deployment
• 100% uptime target with cluster rolling
upgrades
Expanded Deployment Options
• Enhanced business continuity with
replication/archival across on-premises
and cloud storage tiers (Azure Blob, S3)
• Simultaneous ship of Windows and
Linux installs
• Expand Azure support beyond
HDInsight Azure to include HDP for
Windows or Linux in Azure VMs
HDP 2.2
Delivering Apache Hadoop for the Enterprise
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.next: Enterprise SQL at Hadoop Scale
A continuation of momentum built in
Apache Hive Community to deliver
Enterprise SQL at Hadoop scale
HDP Stinger/Hive Goals:
• Speed
Deliver sub-second query response times
• Scale
The only SQL interface to Hadoop
designed for queries that scale from
Gigabytes, to Terabytes and Petabytes
• SQL
Enable transactions and SQL:2011 Analytics
Familiar three phase delivery
Stinger delivered 390,000 lines of code to Apache
Hive in 13 months from 44 companies, 145
developers
HDP 2.2 – Beyond Read Only
• Transactions with ACID, allowing insert, update & delete
• Temporary tables
• Cost Based Optimizer for star & bushy join queries
Phase 2 – Sub Second
• Sub-second queries with LLAP
• Hive-Spark Machine Learning integration
• Operational reporting w/ Hive streaming ingest & transactions
Phase 3 – Rich Analytics
• SQL:2011 Analytics
• Materialized views
• Cross-geo queries
• Workload management via YARN and
LLAP integration
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast and large scale data processing.
– Simple and expressive programming model
– Machine learning, graph computation and Streaming
– in-memory compute for iterative workloads
• It does most of the processing in memory
• It support programming languages
– Java, Scala and Python
• It provides a high level modules for
– Mlib
– GraphX
– Sprak Streaming
– Sprark SQL
• Cluster Manager
– Yarn (recommended)
– Mesos
– Sparks Own
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Stack
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Ready Spark for HDP 2.2 & beyond
HDP 2.2 – Spark on YARN
• Integrated: Hive 0.13 support
• Integrated: Basic ORCfile support
Phase 2 – Spark for HDP 2.2
• Managed: Deployment best practices with YARN Node Labels
• Managed: Ambari Stack Definition:
Install/Start/Stop/Config/Quick links to Spark UI
• Security: Spark certification on Kerberized Cluster
• Security: Authentication in Spark UI against LDAP
Phase 3 - Beyond
• Managed: Enhanced workload mgmnt & improved debuggability
• Managed: Spark logs published to YARN Application Timeline
• Security: Wire Encryption and Authorization with XA/Argus
• Enhanced ORC support
Deliver a reliable and managed,
enterprise grade Apache Spark that
will run alongside other workloads in
Hadoop via YARN
HDP Spark Goals:
• Integrated
Enterprise-grade Workload Management
& Optimized multi-tenancy on YARN
• Secure
Extend comprehensive Hadoop security
policy to Spark
• Managed
Provision, manage and monitor Spark
along with other engines in hadoop
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Bringing more applications and services to
YARN and making ISV adoption easier
• Complete work for Pig with Tez
• Cascading with Tez for Java and Scala apps
• Integration of Spark on YARN
• Kafka for inbound messaging to Storm & Spark – widest
range from real-time to micro batch for internet of things
HDP 2.2 Delivers more YARN Ready Engines
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Others
Engines
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
°
°
°
°
°
°
Others
ISV
Engines
°
°
Storm
Stream
Slider introduces native YARN
integration for applications
with long running services
• HBase, Accumulo, Storm
• SDK for 3rd-party ISVs
Indicates “new to HDP” in 2.2.
All engines have been updated
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Others
Engines
Slider
Solr
Search
HBase
NoSQL
Slider
Accumulo
NoSQL
Slider
Spark
In-Memory
Kafka
Slider
°
°
°
°
HDFS
(Hadoop Distributed File System)
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security in HDP 2.2
HDP 2.2 New Features
• Extend Authorization with Apache Ranger
• Breadth: Knox and Storm integrations
• Policy enforcement at depth:
Hive, HDFS and HBase integrations
• Documentation to support community
development and partner ecosystem
• Apache Hadoop Advances
• TP: HDFS Transparent Encryption in HDFS – HDFS-6134
• Key Management Server - HADOOP-10433
• Key Provider API - HADOOP-10141
Continue investments across for
central security policy for
authentication, authorization, audit,
and data protection
HDP Security Goals:
• Comprehensive Security
Meet all security requirements across
authentication, authorization, audit & data
protection for all HDP components.
• Central Administration
Provide central administration ofg security
policy and for viewing and managing audit
across the platform.
• Consistent Integration
Integrate with other security and identity
management systems, for compliance with IT
policies.
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streamlining Operations in HDP 2.2
Apache Ambari 1.7.0 Delivers
• Views
A common, secure, and extensible approach for the
user interface for Operators, System Administrators,
Application Developers, Data Workers and ISVs
• Blueprints
Create and manage cluster templates for easy
deployment
Apache Ambari is advancing at light
speed to enable the IT operator to
more easily manage clusters
HDP Operations Goals:
• Open
Deliver a complete set of features for
Hadoop operations, in public and with the
community.
• Integrated
Ensure Hadoop operations integrate with
existing IT tools, behind a single pane of
glass.
• Intuitive
Make Hadoop’s most complex operational
challenges easy to manage.
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Ambari 2.0.0 delivers
• Ambari on Windows
• native metrics and alerts
• rolling upgrade automation
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rolling Upgrades
Allow continuous operation and up-time for
applications and services on the cluster while
upgrading
• Single most critical feature for streamlining operations
• HDFS provides the ability to do this today…
remaining components need to follow
• Leverages native operating system tools and scripting
• Allow jobs in-flight to complete
• Provides support for rapid rollback
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Development
& POC Cluster
Production
Cluster
Vision: Maximize Hadoop Deployment Choice
Deployment Choice
• Linux, Windows
• On-Premises, Cloud, Hybrid
“Tethered” Clusters
• Compatible services
• An explicit “connection”
Synchronized Datasets
• Efficient sharing & access
• Governance & lineage
BI or ML
Cluster
Backup
& Archive Cluster
Learn
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Cloudbreak with HDP
Dev / Test
(all HDP services)
Data Science
(Spark)
Cloudbreak
1. Pick a Blueprint
2. Choose a Cloud
3. Launch HDP!
Example Ambari Blueprints:
IoT Apps, BI / Analytics, Data Science, Dev / Test
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Periscope with HDP
Dev / Test
(all HDP services)
Data Science
(Spark)
Autoscaling
Policy
Periscope
• Policies based on any Ambari metrics
• Coordinates with YARN to achieve
elasticity based on the policies.
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You

More Related Content

What's hot

Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

What's hot (20)

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 

Viewers also liked (7)

Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
E01.05 the role of the pm pa1 140826
E01.05 the role of the pm pa1 140826E01.05 the role of the pm pa1 140826
E01.05 the role of the pm pa1 140826
 
what are seo contests and how to rank better in SEO Contests.
what are seo contests and how to rank better in SEO Contests.what are seo contests and how to rank better in SEO Contests.
what are seo contests and how to rank better in SEO Contests.
 
Increasing Your Search Engine Ranking
Increasing Your Search Engine RankingIncreasing Your Search Engine Ranking
Increasing Your Search Engine Ranking
 
Ethos FR-Ingredient for success!
Ethos FR-Ingredient for success!Ethos FR-Ingredient for success!
Ethos FR-Ingredient for success!
 
Internet Goldrush
Internet GoldrushInternet Goldrush
Internet Goldrush
 
Google's Duplicate Internet Content Filter In Action
Google's Duplicate Internet Content Filter In ActionGoogle's Duplicate Internet Content Filter In Action
Google's Duplicate Internet Content Filter In Action
 

Similar to Introduction to the Hadoop EcoSystem

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similar to Introduction to the Hadoop EcoSystem (20)

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 

More from Shivaji Dutta

More from Shivaji Dutta (8)

Life in lock down - A Data Driven Story
Life in lock down - A Data Driven StoryLife in lock down - A Data Driven Story
Life in lock down - A Data Driven Story
 
Deep learning an Introduction with Competitive Landscape
Deep learning an Introduction with Competitive LandscapeDeep learning an Introduction with Competitive Landscape
Deep learning an Introduction with Competitive Landscape
 
Aurius
AuriusAurius
Aurius
 
Deep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformDeep Learning on Qubole Data Platform
Deep Learning on Qubole Data Platform
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Ambari blueprints-overview
Ambari blueprints-overviewAmbari blueprints-overview
Ambari blueprints-overview
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Apache Slider
Apache SliderApache Slider
Apache Slider
 

Introduction to the Hadoop EcoSystem

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks – The Hadoop Ecosystem Fall 2014 Powering the Modern Data Architecture Shivaji Dutta – Sr. Partner Solutions Engineer
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda Apache Hadoop and Hortonworks Data Platform (HDP) HDP and Couchbase What’s new in HDP?
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What is Hadoop Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Projects in Hadoop Hadoop Core – Hadoop Common – Hadoop Distributed File System – Hadoop YARN – Hadoop Mapreduce Other Hadoop Key Projects • Hive • Hbase • Spark • Pig • Tez • Zookeper
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a comprehensive data management platform Hortonworks Data Platform 2.2 YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options Delivered Completely in the OPEN
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS and Yarn – The Core of Hadoop The core components of HDP are YARN and Hadoop Distributed Filesystem (HDFS). YARN is the architectural center of Hadoop that enables you to process data simultaneously in multiple ways. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods. HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data.
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN extends Hadoop into data center leaders YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Access YARN provides the foundation for a versatile range of processing engines that empower you to interact with the same data in multiple ways, at the same time. This means applications can interact with the data in the best way: from batch to interactive SQL or low latency access with NoSQL. Emerging use cases for data science, search and streaming are also supported with Apache Spark, Solr and Storm. Additionally, ecosystem partners provide even more specialized data access engines for YARN. YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance and Integration • HDP extends data access and management with powerful tools for data governance and integration. • They provide a reliable, repeatable, and simple framework for managing the flow of data in and out of Hadoop. This control structure, along with a set of tooling to ease and automate the application of schema or metadata on sources is critical for successful integration of Hadoop into your modern data architecture.• Apache SQOOP • Apache OOZIE • Apache FALCON • Apache FLUME
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Security • Authentication/ Authorization and Encryption • Kerberos • SSL & SASL • Apache Knox • Apache Ranger • HDFS File/Directory Encryption
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Operations – Apache Ambari • Provisioning, manage and monitor Hadoop Clusters • A complete set of operational capabilities that provide both visibilities into the health of your cluster as well as tooling to manage configuration and optimize performance across all data access methods. • Apache Ambari provides APIs to integrate with existing management systems: for instance Microsoft System Center and Teradata ViewPoint
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Hadoop: Central Set of Services YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE OPERATIONS Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive&HCatalog HBase Sqoop Oozie Zookeeper Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.98.0 0.9.1 1.4.4 1.3.1 1.4.0 1.4.4 1.5.1 3.3.2 4.0.0 3.4.5 0.4.0 4.0.0 1.5.1 Falcon 0.5.0 Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process 3.4.5 Tez 0.4.0 Slider 0.60 HDP 2.0 October 2013 HDP 2.2 October 2014 HDP 2.1 April 2014 Solr 4.7.2 4.10.0 0.5.1 Data Access Governance & Integration SecurityOperations
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved OPERATIONAL TOOLS DEV & DATA TOOLS INFRASTRUCTURE The Partner EcoSystemSOURCES EXISTING Systems Clickstream Web &Social Geolocation Sensor & Machine Server Logs Unstructured DATASYSTEM RDBMS EDW MPP HANA APPLICATIONS BusinessObjects BI Deep Partnerships Hortonworks engages in deep engineered relationships with the leaders in the data center, such as Microsoft, Teradata, Redhat, HP, SAS & SAP Broad Partnerships Over 900 partners work with us to certify their applications to work with Hadoop so they can extend big data to their users HDP 2.1 Governance &Integration Security Operations Data Access Data Management YARN
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • Couchbase is primarily online operational NoSQL datastore, low latency, scalable • Source of data and also a sink • Example source: Pulling user profiles into Hadoop for deep analytics • Example sink: training machine learning models that are then cached / served from Couchbase Couchbase and HDP
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • HDP Certified Sqoop connector for batch mode export / import • Couchbase Kafka connector enables both Producer and Consumer scenarios • Community supported Storm spout to persist data by writing to Couchbase Server • Developer Preview Spark Connector Couchbase and HDP New!
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved What’s New in HDP 2.2 New and Improved YARN Ready Engines • Enterprise SQL at Hadoop Scale with Stinger.next • Enterprise Ready Spark on YARN • Deep YARN integration for real-time engines: HBase, Accumulo, Storm • Enabling ISVs with a general SDK and API for direct YARN integration • Only solution to provide real-time to micro batch for analyzing the internet of things • Other engines/tools: Solr, Cascading Continued Innovation of Central Enterprise Services • Centralized security administration and policy enforcement • Ease of use and operations agility features to speed cluster deployment • 100% uptime target with cluster rolling upgrades Expanded Deployment Options • Enhanced business continuity with replication/archival across on-premises and cloud storage tiers (Azure Blob, S3) • Simultaneous ship of Windows and Linux installs • Expand Azure support beyond HDInsight Azure to include HDP for Windows or Linux in Azure VMs HDP 2.2 Delivering Apache Hadoop for the Enterprise
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Stinger.next: Enterprise SQL at Hadoop Scale A continuation of momentum built in Apache Hive Community to deliver Enterprise SQL at Hadoop scale HDP Stinger/Hive Goals: • Speed Deliver sub-second query response times • Scale The only SQL interface to Hadoop designed for queries that scale from Gigabytes, to Terabytes and Petabytes • SQL Enable transactions and SQL:2011 Analytics Familiar three phase delivery Stinger delivered 390,000 lines of code to Apache Hive in 13 months from 44 companies, 145 developers HDP 2.2 – Beyond Read Only • Transactions with ACID, allowing insert, update & delete • Temporary tables • Cost Based Optimizer for star & bushy join queries Phase 2 – Sub Second • Sub-second queries with LLAP • Hive-Spark Machine Learning integration • Operational reporting w/ Hive streaming ingest & transactions Phase 3 – Rich Analytics • SQL:2011 Analytics • Materialized views • Cross-geo queries • Workload management via YARN and LLAP integration HDP2.2 Security Operations Governance Access Management YARN
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Spark • Apache Spark is an open source project for fast and large scale data processing. – Simple and expressive programming model – Machine learning, graph computation and Streaming – in-memory compute for iterative workloads • It does most of the processing in memory • It support programming languages – Java, Scala and Python • It provides a high level modules for – Mlib – GraphX – Sprak Streaming – Sprark SQL • Cluster Manager – Yarn (recommended) – Mesos – Sparks Own
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark Stack
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enterprise Ready Spark for HDP 2.2 & beyond HDP 2.2 – Spark on YARN • Integrated: Hive 0.13 support • Integrated: Basic ORCfile support Phase 2 – Spark for HDP 2.2 • Managed: Deployment best practices with YARN Node Labels • Managed: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI • Security: Spark certification on Kerberized Cluster • Security: Authentication in Spark UI against LDAP Phase 3 - Beyond • Managed: Enhanced workload mgmnt & improved debuggability • Managed: Spark logs published to YARN Application Timeline • Security: Wire Encryption and Authorization with XA/Argus • Enhanced ORC support Deliver a reliable and managed, enterprise grade Apache Spark that will run alongside other workloads in Hadoop via YARN HDP Spark Goals: • Integrated Enterprise-grade Workload Management & Optimized multi-tenancy on YARN • Secure Extend comprehensive Hadoop security policy to Spark • Managed Provision, manage and monitor Spark along with other engines in hadoop HDP2.2 Security Operations Governance Access Management YARN
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Bringing more applications and services to YARN and making ISV adoption easier • Complete work for Pig with Tez • Cascading with Tez for Java and Scala apps • Integration of Spark on YARN • Kafka for inbound messaging to Storm & Spark – widest range from real-time to micro batch for internet of things HDP 2.2 Delivers more YARN Ready Engines YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Others Engines Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° Storm Stream Slider introduces native YARN integration for applications with long running services • HBase, Accumulo, Storm • SDK for 3rd-party ISVs Indicates “new to HDP” in 2.2. All engines have been updated HDP2.2 Security Operations Governance Access Management YARN Others Engines Slider Solr Search HBase NoSQL Slider Accumulo NoSQL Slider Spark In-Memory Kafka Slider ° ° ° ° HDFS (Hadoop Distributed File System)
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Security in HDP 2.2 HDP 2.2 New Features • Extend Authorization with Apache Ranger • Breadth: Knox and Storm integrations • Policy enforcement at depth: Hive, HDFS and HBase integrations • Documentation to support community development and partner ecosystem • Apache Hadoop Advances • TP: HDFS Transparent Encryption in HDFS – HDFS-6134 • Key Management Server - HADOOP-10433 • Key Provider API - HADOOP-10141 Continue investments across for central security policy for authentication, authorization, audit, and data protection HDP Security Goals: • Comprehensive Security Meet all security requirements across authentication, authorization, audit & data protection for all HDP components. • Central Administration Provide central administration ofg security policy and for viewing and managing audit across the platform. • Consistent Integration Integrate with other security and identity management systems, for compliance with IT policies. HDP2.2 Security Operations Governance Access Management YARN
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streamlining Operations in HDP 2.2 Apache Ambari 1.7.0 Delivers • Views A common, secure, and extensible approach for the user interface for Operators, System Administrators, Application Developers, Data Workers and ISVs • Blueprints Create and manage cluster templates for easy deployment Apache Ambari is advancing at light speed to enable the IT operator to more easily manage clusters HDP Operations Goals: • Open Deliver a complete set of features for Hadoop operations, in public and with the community. • Integrated Ensure Hadoop operations integrate with existing IT tools, behind a single pane of glass. • Intuitive Make Hadoop’s most complex operational challenges easy to manage. HDP2.2 Security Operations Governance Access Management YARN Ambari 2.0.0 delivers • Ambari on Windows • native metrics and alerts • rolling upgrade automation
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Rolling Upgrades Allow continuous operation and up-time for applications and services on the cluster while upgrading • Single most critical feature for streamlining operations • HDFS provides the ability to do this today… remaining components need to follow • Leverages native operating system tools and scripting • Allow jobs in-flight to complete • Provides support for rapid rollback HDP2.2 Security Operations Governance Access Management YARN
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Development & POC Cluster Production Cluster Vision: Maximize Hadoop Deployment Choice Deployment Choice • Linux, Windows • On-Premises, Cloud, Hybrid “Tethered” Clusters • Compatible services • An explicit “connection” Synchronized Datasets • Efficient sharing & access • Governance & lineage BI or ML Cluster Backup & Archive Cluster Learn
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved BI / Analytics (Hive) IoT Apps (Storm, HBase, Hive) Cloudbreak with HDP Dev / Test (all HDP services) Data Science (Spark) Cloudbreak 1. Pick a Blueprint 2. Choose a Cloud 3. Launch HDP! Example Ambari Blueprints: IoT Apps, BI / Analytics, Data Science, Dev / Test
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved BI / Analytics (Hive) IoT Apps (Storm, HBase, Hive) Periscope with HDP Dev / Test (all HDP services) Data Science (Spark) Autoscaling Policy Periscope • Policies based on any Ambari metrics • Coordinates with YARN to achieve elasticity based on the policies.
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You

Editor's Notes

  1. The Hortonworks Data Platform therefore addresses all of these capabilities – completely in Open Source.   YARN is the architectural center of HDP and Hadoop that not only enables multiple data access engines across batch, interactive and real-time to all work on a single set of data but also extends Hadoop to integrate with the existing systems and tools you already have in your data center.   HDP delivers on the key enterprise requirements of governance, security and operations.   And of course it is supported on the widest possible range of deployment options: from Linux, to Windows (the only hadoop offering on Windows), appliance (from Microsoft or Teradata) or Cloud (Microsoft, Rackspace and more).   HDP is a comprehensive data management platform with one goal in mind: to enable an enterprise architecture with Hadoop.
  2. Finally, there is only ONE Apache Hadoop. Every other package of hadoop is a vendor derivation of the platform. At Hortonworks, everything we package in HDP is from the very latest components at the apache software foundation. This ensures that our customers have access to the very latest innovation from the community, to which we then apply enterprise software rigor to the build, test and release process to create HDP.   HDP “IS” Apache Hadoop – it is not a vendor derivative that has been forked and modified, it IS Apache Hadoop, no additions, no hold-backs.   When comparing Hadoop offerings vendors it is critical to understand this picture as it makes it clear where vendors are diverging from the community approach and ultimately locking customers out of the community innovation.
  3. The modern data architecture simply does not work unless it integrates with the systems and tools you already deploy. HDP enables your existing data platforms to expand the data you have under management through integration. The goal of HDO is to augment not replace these existing systems as we very clearly understand that you need to ruuse skills.   Further, through our work within the Hadoop community to deliver YARN, we have opened up Hadoop and unlocked innovation in the community of data center ISVs can extend their applications so that they can run natively IN Hadoop as just another workload operating on the single set of data lake. They can now function as a first class citizen alongside any other workload in Hadoop.
  4. Cloudbreak is the infrastructure-agnostic and secure Hadoop as a Service API for multi-tenant clusters. This technology first appeared as a beta in July 2014 and marked the first collaboration between Hortonworks and the SequenceIQ team. We leveraged the extensibility of Apache Ambari via Blueprints to deliver this easy to use deployment technology. Hadoop provisioning using Docker containers was also presented at Hadoop Summit 2014 in San Jose, and we observed an overwhelmingly positive reception from the open source community for these innovations.
  5. The SequenceIQ team developed Periscope to bring policy-based autoscaling to Hadoop. Periscope ensures that you can meet your service level agreements while running your applications. Just like Cloudbreak, Periscope is built atop Apache Ambari and Apache Hadoop YARN, and it leverages the latest cutting edge features of these projects.