IBM Cloud Paris meetup 20180213 - Hortonworks

1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
IBM Cloud Paris Meetup
Abdelkrim Hadjidj
Solutions Engineering @ Hortonworks

Hortonworks: Enabling the Modern Data Architecture
Our mission continues…
Ã Make Hadoop an enterprise viable data platform
Ã Bring all data under management – all sources and types
Ã Extend to Global Data Management
Hortonworks consistent and continuous track record of innovation

MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
DATAPLANE SERVICE (DPS)
MANAGE, GOVERN, SECURE
DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO*
ISV
SERVICES
*not yet available, coming soon
EXTENSIBLE SERVICES
IBM DSX*CLOUD-
BREAK*
DATA
ANALYTICS
STUDIO*
CONNECTED DATA PLATFORMS
HORTONWORKS
DATA PLATFORM (HDP®)
DATA-AT-REST
HORTONWORKS
DATAFLOW (HDF™)
DATA-IN-MOTION
MODERN DATA USE CASES
EDW
OPTIMIZATION
CYBER SECURITY DATA SCIENCE
ADVANCED
ANALYTICS
PARTNER
SOLUTIONS
IOT/ STREAMING
ANALYTICS
HORTONWORKS
CONNECTION
ENTERPRISE SUPPORT
PREMIER SUPPORT
EDUCATIONAL SERVICES
PROFESSIONAL SERVICES
COMMUNITY CONNECTION
HORTONWORKS
PLATFORM SERVICES
OPERATIONAL SERVICES
SMARTSENSE™
Global Data Management With Hortonworks

Data Lake 3.0: Scale & Cut TCO; Deploy in Minutes
-Siloed Multiple deployments w/ low utilization
-Longer time to deployment (days)
HDP HDP HDP
1.6
Dockerized
HDP Clusters
Ambari YARN
HDFS
Dockerized
Kubernetes
Clusters
Dockerized
Stateful
Microservices
CBS
(Container
Block Storage)
Hadoop
APIs
S3 APIs
YARN Ozone
Bare Metal Shared Services in the Physical Cluster
Pre-Data Lake 3.0 Data Lake 3.0
-Scale 10s of Billions of files, Thousands of nodes
-Data Efficiency 2x storage reduction
-Compute Efficiency Dockerized micro-services on demand (separate compute & storage)
-Generic file-system Hadoop APIs, S3 APIs, Block
-3rd Party Kubernetes, ISVs such as IBM DSX
Dockerized
Ephemeral
HDP Clusters
Ambari YARN
HDFS
Dockerized
Multi-Version
Dockerized
Stateless
Microservices
Hadoop
APIs
YARN HDFS
Bare Metal Shared Services in the Physical Cluster
HDP 3.0 Post HDP 3.0

Data Lake 3.0 Architecture
Hortonworks Data Plane Service
(DPS) User Interface
EC2 Google
Application/Container Registry
Templatized
app#2
Templatized
app#3
Templatized
app#1
YARN API
ISVs (Modern Data Applications)
CyberSecurity
IoT apps
Container Scheduler for Data Intensive Apps in a Shared Cluster (YARN)
(Queue/App Priority, User Pre-emption, SLA guarantee, GPU Isolation/pooling)
Service Discovery
App Logs
App Metrics & Monitoring
App Security & Governance
Cold Tier
-Erasure Coding with 1.5x OH
-Archive HDD
-
Hot Tier
-Fast SSD (3x replica)
GPU Resources
IBM

Hadoop 3.x

HDP 3.0 Focus Areas
• Business Agility
• Allow Hive / Spark / Zeppelin to innovate faster than core
• 3.0 Adoption
• Time to Market – Hadoop 3.0
• Upgradability - Painless for users to take on 3.0
• Scalability, Containerized Micro-Services & Storage Efficient Archive
• NN Federation/ViewFS gives a way to scale the Hadoop solution
• Containerization gives more workloads on the solution and DS workloads can control dependencies
• Erasure Coding allows for keeping more data longer
• GPUs for AI & Deep Learning
• GPU Scheduling
• Docker Container Example with DL Libraries
• 3rd Party tools
• DSX

HDP 3.0 –Erasure Coding Enables Active Archive
⬢ Erasure Coding in HDP 3.0
– Default: Reed Solomon 6 data + 3
parity
– 2x storage savings (1.5x vs. 3x w/
replica)
– Tolerates up to 3 node failure
(continues writing w/ 6 data nodes)
– Support Hive/Tez query on data
written directly to EC zone
⬢ Other Notable Items
– Intra-node disk balancing
– Data tiering with policy
– Optimize around slow disk/network
– Support more than 2 Name Nodes
Erasure Coding (Archive: 1.5)
Cold
3 parity blocks
Data Nodes Data Nodes
Nodes
Media
Disk - Hard DriveSSD - Solid State Drive
N Replicas (SSD: 3)
Hot (default) Warm
N Replicas (Disk: 1, Archive: 2)
Files/
Directories
Archive - High Density Drives
Data Nodes
b3
b1
b2
P1
b6
b4
b5
P2
P3
6 data blocks

Ozone – Why an Object Store?
• With workloads like IoT we are looking at scaling to trillions of objects.
• Apache HDFS is designed for large objects – not for many small objects
• Small files create memory pressure on namenode.
• Each small file creates a block in the datanode.
• Datanodes send all block information to namenode in BlockReports.
• Both of these create scalability issues on Namenode.
• Metadata in memory is the strength of the original GFS and HDFS design, but also its
weakness in scaling number of files and blocks.
• An object store has simpler semantics than a file system and is easier to scale

Ozone – Why an Object Store (continued)
• Ozone attempts to scale to trillions of objects
• Ozone is built on a distributed metadata store.
• Avoids any single server becoming a bottleneck
• More parallelism possible in both data and metadata operations
• Build on well tested components and understood protocols
• RAFT for consensus
• RAFT is a protocol for reaching consensus between a set of machines in an
unreliable environment where machines and network may fail.
• Off-the-shelf Key-Value store like RocksDB

HDP 3.0 - HDFS Federation Enables Linear Scale
DN 1 DN 2 DN m
.. .. ..
NS1
Future NS
e.g. Ozone
FS... ...
NS k
Block Management Layer
Block Pool
n
Block Pool
k
Block
Pool 1
NN-1 NN-k
Common Storage
Block StorageNamespace
⬢ Extends HDFS to multiple volumes
– But the volumes can shares the physical storage of
the DNs
– You still get the Hadoop data locality for compute
⬢ Optionally, can mount a Hadoop Compatible FS
volume
– Commonly used in cloud for cloud storage – S3,
WASB, ADLS
– Use with upcoming OzoneFS
Hot
data
Hot
data
Cold data

Details YARN Functionalities
Building Blocks
RAS
-Application Timeline Services v2
-Delete queues without RM restart
-Efficient resource handling on queue stop
-Easier YARN UI
Powerful YARN
-Disk isolation
-Network IO throttling
-Application priority in Cap Scheduler
-Intra-queue pre-emption
-Support admin configurable resource
type (e.g. GPU)
-Affinity/anti-affinity support
New Features
-Docker containers on YARN
-Seamless upgrade of apps on YARN
Assembly of Micro-
Services
• Consumable micro-services (Metron)
• Deployed and managed Ambari
Long Running Apps
• Spark, HBase, Kafka, Solr, Storm on Docker
• Deployed and managed by Ambari
Multiple Ephemeral HDPs
• Deployed by Cloudbreak; individual
clusters managed by Ambari

Built-in support for long running Service in YARN
Ã A native YARN framework. YARN-4692
• Abstract common Framework (Similar to Slider) to support long running service
• More simplified API (to manage service lifecycle)
• Better support for long running service
Ã Recognition of long running service
• Affect the policy of preemption, container reservation, etc.
• Auto-restart of containers
• Containers for long running service are retried to same node in case of local state
Ã Service/application upgrade support – YARN-4726
• In general, services are expected to run long enough to cross versions
Ã Dynamic container configuration

More Powerful YARN
Ã Resource Isolation
– Resource isolation support for disk and network
• YARN-2619 (disk), YARN-2140 (network)
• Containers get a fair share of disk and network resources using Cgroups
– Docker support in LinuxContainerExecutor
• YARN-3611
• Support to launch Docker containers alongside process
• Packaging and resource isolation
• Complements YARN’s support for long running services

Data Lake 3.0 Architecture
Hortonworks Data Plane Service
(DPS) User Interface
EC2 Google
Application/Container Registry
Templatized
app#2
Templatized
app#3
Templatized
app#1
YARN API
ISVs (Modern Data Applications)
CyberSecurity
IoT apps
Container Scheduler for Data Intensive Apps in a Shared Cluster (YARN)
(Queue/App Priority, User Pre-emption, SLA guarantee, GPU Isolation/pooling)
Service Discovery
App Logs
App Metrics & Monitoring
App Security & Governance
Cold Tier
-Erasure Coding with 1.5x OH
-Archive HDD
-
Hot Tier
-Fast SSD (3x replica)
GPU Resources
IBM

Thanks

IBM Cloud Paris meetup 20180213 - Hortonworks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBM Cloud Paris meetup 20180213 - Hortonworks

Similar to IBM Cloud Paris meetup 20180213 - Hortonworks (20)

More from IBM France Lab

More from IBM France Lab (20)

Recently uploaded

Recently uploaded (20)

IBM Cloud Paris meetup 20180213 - Hortonworks