More Related Content Similar to Introduction to the Hadoop EcoSystem (20) More from Shivaji Dutta (8) Introduction to the Hadoop EcoSystem1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks – The Hadoop Ecosystem
Fall 2014
Powering the Modern Data Architecture
Shivaji Dutta – Sr. Partner
Solutions Engineer
2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
Apache Hadoop and Hortonworks Data Platform (HDP)
HDP and Couchbase
What’s new in HDP?
3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Hadoop
Apache Hadoop is an open-source software framework for distributed
storage and distributed processing of very large data sets on computer
clusters built from commodity hardware.
4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Projects in Hadoop
Hadoop Core
– Hadoop Common
– Hadoop Distributed File System
– Hadoop YARN
– Hadoop Mapreduce
Other Hadoop Key Projects
• Hive
• Hbase
• Spark
• Pig
• Tez
• Zookeper
5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP delivers a comprehensive data management platform
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered Completely in the OPEN
6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS and Yarn – The Core of Hadoop
The core components of HDP are YARN and
Hadoop Distributed Filesystem (HDFS).
YARN is the architectural center of Hadoop
that enables you to process data
simultaneously in multiple ways. YARN
provides the resource management and
pluggable architecture for enabling a wide
variety of data access methods.
HDFS provides the scalable, fault-tolerant,
cost-efficient storage for big data.
7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN extends Hadoop into data center leaders
YARN
The Architectural
Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications
Facilitates ongoing innovation and enterprise adoption via
ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Access
YARN provides the foundation for a versatile
range of processing engines that empower
you to interact with the same data in multiple
ways, at the same time.
This means applications can interact with the
data in the best way: from batch to interactive
SQL or low latency access with NoSQL.
Emerging use cases for data science, search
and streaming are also supported with
Apache Spark, Solr and Storm.
Additionally, ecosystem partners provide even
more specialized data access engines for
YARN.
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance and Integration
• HDP extends data access and
management with powerful tools for data
governance and integration.
• They provide a reliable, repeatable, and
simple framework for managing the flow of
data in and out of Hadoop. This control
structure, along with a set of tooling to ease
and automate the application of schema or
metadata on sources is critical for
successful integration of Hadoop into your
modern data architecture.• Apache SQOOP
• Apache OOZIE
• Apache FALCON
• Apache FLUME
10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security
• Authentication/ Authorization and
Encryption
• Kerberos
• SSL & SASL
• Apache Knox
• Apache Ranger
• HDFS File/Directory Encryption
11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations – Apache Ambari
• Provisioning, manage and monitor
Hadoop Clusters
• A complete set of operational
capabilities that provide both
visibilities into the health of your
cluster as well as tooling to
manage configuration and optimize
performance across all data access
methods.
• Apache Ambari provides APIs to
integrate with existing management
systems: for instance Microsoft
System Center and Teradata
ViewPoint
12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be
an Enterprise Data Platform
with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into
Hadoop inherits these services
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Deploy and
effectively
manage the
platform
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Hadoop
&YARN
Pig
Hive&HCatalog
HBase
Sqoop
Oozie
Zookeeper
Ambari
Storm
Flume
Knox
Phoenix
Accumulo
2.2.0
0.12.0
0.12.0
2.4.0
0.12.1
Data
Management
0.13.0
0.96.1
0.98.0
0.9.1
1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5
0.4.0
4.0.0
1.5.1
Falcon
0.5.0
Ranger
Spark
Kafka
0.14.0
0.14.0
0.98.4
1.6.1
4.2
0.9.3
1.2.0
0.6.0
0.8.1
1.4.5
1.5.0
1.7.0
4.1.0
0.5.0
0.4.0
2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slider
0.60
HDP 2.0
October
2013
HDP 2.2
October
2014
HDP 2.1
April
2014
Solr
4.7.2
4.10.0
0.5.1
Data Access
Governance
& Integration
SecurityOperations
14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL TOOLS
DEV & DATA TOOLS
INFRASTRUCTURE
The Partner EcoSystemSOURCES
EXISTING
Systems
Clickstream Web &Social Geolocation Sensor &
Machine
Server Logs Unstructured
DATASYSTEM
RDBMS EDW MPP
HANA
APPLICATIONS
BusinessObjects BI
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as Microsoft, Teradata, Redhat,
HP, SAS & SAP
Broad Partnerships
Over 900 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
HDP 2.1
Governance
&Integration
Security
Operations
Data Access
Data Management
YARN
15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• Couchbase is primarily online operational NoSQL datastore, low latency,
scalable
• Source of data and also a sink
• Example source: Pulling user profiles into Hadoop for deep analytics
• Example sink: training machine learning models that are then cached /
served from Couchbase
Couchbase and HDP
16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
• HDP Certified Sqoop connector for batch
mode export / import
• Couchbase Kafka connector enables both
Producer and Consumer scenarios
• Community supported Storm spout to
persist data by writing to Couchbase
Server
• Developer Preview Spark Connector
Couchbase and HDP
New!
17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What’s New in HDP 2.2
New and Improved YARN
Ready Engines
• Enterprise SQL at Hadoop Scale with
Stinger.next
• Enterprise Ready Spark on YARN
• Deep YARN integration for real-time
engines: HBase, Accumulo, Storm
• Enabling ISVs with a general SDK and
API for direct YARN integration
• Only solution to provide real-time to micro
batch for analyzing the internet of things
• Other engines/tools: Solr, Cascading
Continued Innovation of
Central Enterprise Services
• Centralized security administration
and policy enforcement
• Ease of use and operations agility
features to speed cluster deployment
• 100% uptime target with cluster rolling
upgrades
Expanded Deployment Options
• Enhanced business continuity with
replication/archival across on-premises
and cloud storage tiers (Azure Blob, S3)
• Simultaneous ship of Windows and
Linux installs
• Expand Azure support beyond
HDInsight Azure to include HDP for
Windows or Linux in Azure VMs
HDP 2.2
Delivering Apache Hadoop for the Enterprise
18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.next: Enterprise SQL at Hadoop Scale
A continuation of momentum built in
Apache Hive Community to deliver
Enterprise SQL at Hadoop scale
HDP Stinger/Hive Goals:
• Speed
Deliver sub-second query response times
• Scale
The only SQL interface to Hadoop
designed for queries that scale from
Gigabytes, to Terabytes and Petabytes
• SQL
Enable transactions and SQL:2011 Analytics
Familiar three phase delivery
Stinger delivered 390,000 lines of code to Apache
Hive in 13 months from 44 companies, 145
developers
HDP 2.2 – Beyond Read Only
• Transactions with ACID, allowing insert, update & delete
• Temporary tables
• Cost Based Optimizer for star & bushy join queries
Phase 2 – Sub Second
• Sub-second queries with LLAP
• Hive-Spark Machine Learning integration
• Operational reporting w/ Hive streaming ingest & transactions
Phase 3 – Rich Analytics
• SQL:2011 Analytics
• Materialized views
• Cross-geo queries
• Workload management via YARN and
LLAP integration
HDP2.2
Security
Operations
Governance
Access
Management
YARN
19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast and large scale data processing.
– Simple and expressive programming model
– Machine learning, graph computation and Streaming
– in-memory compute for iterative workloads
• It does most of the processing in memory
• It support programming languages
– Java, Scala and Python
• It provides a high level modules for
– Mlib
– GraphX
– Sprak Streaming
– Sprark SQL
• Cluster Manager
– Yarn (recommended)
– Mesos
– Sparks Own
21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Ready Spark for HDP 2.2 & beyond
HDP 2.2 – Spark on YARN
• Integrated: Hive 0.13 support
• Integrated: Basic ORCfile support
Phase 2 – Spark for HDP 2.2
• Managed: Deployment best practices with YARN Node Labels
• Managed: Ambari Stack Definition:
Install/Start/Stop/Config/Quick links to Spark UI
• Security: Spark certification on Kerberized Cluster
• Security: Authentication in Spark UI against LDAP
Phase 3 - Beyond
• Managed: Enhanced workload mgmnt & improved debuggability
• Managed: Spark logs published to YARN Application Timeline
• Security: Wire Encryption and Authorization with XA/Argus
• Enhanced ORC support
Deliver a reliable and managed,
enterprise grade Apache Spark that
will run alongside other workloads in
Hadoop via YARN
HDP Spark Goals:
• Integrated
Enterprise-grade Workload Management
& Optimized multi-tenancy on YARN
• Secure
Extend comprehensive Hadoop security
policy to Spark
• Managed
Provision, manage and monitor Spark
along with other engines in hadoop
HDP2.2
Security
Operations
Governance
Access
Management
YARN
22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Bringing more applications and services to
YARN and making ISV adoption easier
• Complete work for Pig with Tez
• Cascading with Tez for Java and Scala apps
• Integration of Spark on YARN
• Kafka for inbound messaging to Storm & Spark – widest
range from real-time to micro batch for internet of things
HDP 2.2 Delivers more YARN Ready Engines
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Others
Engines
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
°
°
°
°
°
°
Others
ISV
Engines
°
°
Storm
Stream
Slider introduces native YARN
integration for applications
with long running services
• HBase, Accumulo, Storm
• SDK for 3rd-party ISVs
Indicates “new to HDP” in 2.2.
All engines have been updated
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Others
Engines
Slider
Solr
Search
HBase
NoSQL
Slider
Accumulo
NoSQL
Slider
Spark
In-Memory
Kafka
Slider
°
°
°
°
HDFS
(Hadoop Distributed File System)
23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Security in HDP 2.2
HDP 2.2 New Features
• Extend Authorization with Apache Ranger
• Breadth: Knox and Storm integrations
• Policy enforcement at depth:
Hive, HDFS and HBase integrations
• Documentation to support community
development and partner ecosystem
• Apache Hadoop Advances
• TP: HDFS Transparent Encryption in HDFS – HDFS-6134
• Key Management Server - HADOOP-10433
• Key Provider API - HADOOP-10141
Continue investments across for
central security policy for
authentication, authorization, audit,
and data protection
HDP Security Goals:
• Comprehensive Security
Meet all security requirements across
authentication, authorization, audit & data
protection for all HDP components.
• Central Administration
Provide central administration ofg security
policy and for viewing and managing audit
across the platform.
• Consistent Integration
Integrate with other security and identity
management systems, for compliance with IT
policies.
HDP2.2
Security
Operations
Governance
Access
Management
YARN
24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Streamlining Operations in HDP 2.2
Apache Ambari 1.7.0 Delivers
• Views
A common, secure, and extensible approach for the
user interface for Operators, System Administrators,
Application Developers, Data Workers and ISVs
• Blueprints
Create and manage cluster templates for easy
deployment
Apache Ambari is advancing at light
speed to enable the IT operator to
more easily manage clusters
HDP Operations Goals:
• Open
Deliver a complete set of features for
Hadoop operations, in public and with the
community.
• Integrated
Ensure Hadoop operations integrate with
existing IT tools, behind a single pane of
glass.
• Intuitive
Make Hadoop’s most complex operational
challenges easy to manage.
HDP2.2
Security
Operations
Governance
Access
Management
YARN
Ambari 2.0.0 delivers
• Ambari on Windows
• native metrics and alerts
• rolling upgrade automation
25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rolling Upgrades
Allow continuous operation and up-time for
applications and services on the cluster while
upgrading
• Single most critical feature for streamlining operations
• HDFS provides the ability to do this today…
remaining components need to follow
• Leverages native operating system tools and scripting
• Allow jobs in-flight to complete
• Provides support for rapid rollback
HDP2.2
Security
Operations
Governance
Access
Management
YARN
26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Development
& POC Cluster
Production
Cluster
Vision: Maximize Hadoop Deployment Choice
Deployment Choice
• Linux, Windows
• On-Premises, Cloud, Hybrid
“Tethered” Clusters
• Compatible services
• An explicit “connection”
Synchronized Datasets
• Efficient sharing & access
• Governance & lineage
BI or ML
Cluster
Backup
& Archive Cluster
Learn
27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Cloudbreak with HDP
Dev / Test
(all HDP services)
Data Science
(Spark)
Cloudbreak
1. Pick a Blueprint
2. Choose a Cloud
3. Launch HDP!
Example Ambari Blueprints:
IoT Apps, BI / Analytics, Data Science, Dev / Test
28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Periscope with HDP
Dev / Test
(all HDP services)
Data Science
(Spark)
Autoscaling
Policy
Periscope
• Policies based on any Ambari metrics
• Coordinates with YARN to achieve
elasticity based on the policies.
Editor's Notes The Hortonworks Data Platform therefore addresses all of these capabilities – completely in Open Source.
YARN is the architectural center of HDP and Hadoop that not only enables multiple data access engines across batch, interactive and real-time to all work on a single set of data but also extends Hadoop to integrate with the existing systems and tools you already have in your data center.
HDP delivers on the key enterprise requirements of governance, security and operations.
And of course it is supported on the widest possible range of deployment options: from Linux, to Windows (the only hadoop offering on Windows), appliance (from Microsoft or Teradata) or Cloud (Microsoft, Rackspace and more).
HDP is a comprehensive data management platform with one goal in mind: to enable an enterprise architecture with Hadoop.
Finally, there is only ONE Apache Hadoop. Every other package of hadoop is a vendor derivation of the platform.
At Hortonworks, everything we package in HDP is from the very latest components at the apache software foundation. This ensures that our customers have access to the very latest innovation from the community, to which we then apply enterprise software rigor to the build, test and release process to create HDP.
HDP “IS” Apache Hadoop – it is not a vendor derivative that has been forked and modified, it IS Apache Hadoop, no additions, no hold-backs.
When comparing Hadoop offerings vendors it is critical to understand this picture as it makes it clear where vendors are diverging from the community approach and ultimately locking customers out of the community innovation.
The modern data architecture simply does not work unless it integrates with the systems and tools you already deploy. HDP enables your existing data platforms to expand the data you have under management through integration. The goal of HDO is to augment not replace these existing systems as we very clearly understand that you need to ruuse skills.
Further, through our work within the Hadoop community to deliver YARN, we have opened up Hadoop and unlocked innovation in the community of data center ISVs can extend their applications so that they can run natively IN Hadoop as just another workload operating on the single set of data lake. They can now function as a first class citizen alongside any other workload in Hadoop.
Cloudbreak is the infrastructure-agnostic and secure Hadoop as a Service API for multi-tenant clusters.
This technology first appeared as a beta in July 2014 and marked the first collaboration between Hortonworks and the SequenceIQ team.
We leveraged the extensibility of Apache Ambari via Blueprints to deliver this easy to use deployment technology.
Hadoop provisioning using Docker containers was also presented at Hadoop Summit 2014 in San Jose, and we observed an overwhelmingly positive reception from the open source community for these innovations. The SequenceIQ team developed Periscope to bring policy-based autoscaling to Hadoop.
Periscope ensures that you can meet your service level agreements while running your applications.
Just like Cloudbreak, Periscope is built atop Apache Ambari and Apache Hadoop YARN, and it leverages the latest cutting edge features of these projects.