Realtime analytics + hadoop 2.0

Realtime Analytics in Hadoop
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Rommel Garcia – Solution Engineer
October 10, 2014

Hadoop

Hadoop provides
• Terabytes to Petabytes of storage on commodity hardware (HDFS)
• Massive parallel computation on enormous amount of data (YARN)
Hadoop is essentially a supercomputer for the masses!

HDFS: Scalable, Reliable, Secure Storage Platform
The Storage Platform for the Modern Data Architecture
YARN: Data Operating System
B A B A C A
C A B C B B A C
HDFS
(Hadoop Distributed File System)
Reliable
Highly Available &Fault Tolerant
Protects against data loss &
corruption
Cost Effective
Horizontally scales on
Commodity Hardware
Secure
Strong access controls, integrated
with authentication mechanisms
Granular data access controls to
datasets across users and groups
NFS
Source/Dest
ination
REST
RPC
Source/Dest
ination
Source/Dest
ination
Standards
Based Data
Interfaces
Ingest and store any data in any format
Flexible read access enables a variety
of work loads

Hadoop 1
Single Use Data Platform
Hive Pig
Batch
HADOOP 1
Mapreduce
Redundant, Reliable Storage
(HDFS)
Java

2006 2009
MR-279: YARN
Hadoop w/ MapReduce
MapReduce
Largely Batch Processing
1 ° ° ° ° °
HDFS
° ° ° ° ° N
Hadoop2 & YARN based Architecture
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
N °
HDFS
Silo’d clusters
Largely batch system
Difficult to integrate
Hadoop 2 & YARN
Batch Interactive Real-Time
Enabled the
Modern Data
Architecture
October 23, 2013

Hadoop
Multi Use Data Platform
Batch, Interactive, Realtime, Online, Streaming, …
Management & Shared Services
HADOOP 2
Efficient Cluster Resource
(YARN)
Redundant, Reliable Storage
(HDFS)
Standard Query
Processing
Hive
Batch
MapReduce
Online Data
Processing
Interactive
Tez
Real Time Stream
Processing Others

Why Are Enterprises Using Hadoop?

Traditional systems under pressure
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
RDBMS EDW MPP
Packaged
Applications
• Silos of Data
• Costly to Scale
• Constrained Schemas
Clickstream
Geolocation
Sentiment, Web Data
Sensor, Machine Data (IoT)
Unstructured docs, emails
Server logs
SOURCES
Existing Sources
(CRM, ERP,…)
New Data Types
…and difficult to
manage new data

Hadoop 2 and YARN enable the Modern Data Architecture
Batch Interactive Real-Time
HDFS
Common data set, multiple applications
• Optionally land all data in a single cluster
• Batch, interactive & real-time use cases
• Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
• Consistent security, governance & operations
• Ecosystem applications run natively in Hadoop
SOURCES
EXISTING
Systems
Clickstream Web
&Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
DATA SYSTEM APPLICATIONS
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N

Real-Time Use Cases

Realtime Analytics in…
$
• Fraud Detection/Prevention • Cell tower diagnostics • Proactive Maintenance
• Bandwidth Allocation
• Brand Sentiment Analysis
• Localized, Personalized
Promotions
Financial
Services
Retail Telecom Manufacturing
Healthcare
Utilities,
Oil & Gas
Public
Sector
• Monitor patient vitals
• Patient care and safety
• Reduce re-admittance rates
• Smart meter stream
analysis
• Proactive equipment repair
• Power and consumption
matching
• Network intrusion detection
and prevention
• Disease outbreak detection
Transportation
• Unsafe driving detection and
monitoring

Truck Demo: Real-Time Analytics
Problem:
• The only way to measure “safe driving” is through accident
occurences.
• There’s no realtime accident prevention mechanism in place
Solution:
• Use Hadoop to analyze driving violations in real-time
• Provide a UI to view to real-time violation alerts
• Provide a dashboard to review violation reports

Demo Time !

Truck Demo Real-Time Hadoop Architecture
Truck Events
High Speed Ingestion
Message Queue
Distributed Processing
Kafka
Storm
Show Driving Report
HDFS/Hive HBase
(ActiveMQ)
Solr
(Reporting
Dashboard)
Real-Time
Monitoring App
Truck Event Data Alerts Violations
Show

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A

Hadoop 2.0
October 10, 2014

Hadoop 2 Becoming A Critical Platform

Hadoop 2 delivers a comprehensive data management platform
Hadoop 2 Platform
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
NFS
WebHDFS
In-Memory
Spark
DATA MANAGEMENT
SECURITY
BATCH, INTERACTIVE & REAL-TIME
DATA ACCESS
GOVERNANCE
& INTEGRATION
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
OPERATIONS
Script
Pig
Search
Solr
SQL
Hive
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS
Deployment Choice
Linux Windows On-
Premise
Cloud
YARN is the architectural
center of Hadoop 2
• Enables batch, interactive
and real-time workloads
• Single SQL engine for both batch
and interactive
• Enable existing ISV apps to plug
directly into Hadoop via YARN
Provides comprehensive
enterprise capabilities
• Governance
• Security
• Operations
The widest range of
deployment options
• Linux & Windows
• On premise & cloud
Tez Tez

YARN – Roadmap

YARN Development Framework
API
Engine
System
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
° °
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
Batch
MapReduce
Real-Time
Slider
Direct
Java
.NET
Scripting
Pig
SQL
Hive
Cascading
Java
Scala
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISV
Other
ISV
Applications
Others
Spark
Other ISV
New New
New New
Tez Tez Tez Tez New

YARN General Store – The Future
• A Data Lake that has a General Store to continually serve you….
– App Store – YARN Ready Applications
– Data Store – Where do I get the interesting data…Weather, Geo, ..etc.
– View Store – How do I get UI’s to the cluster
– Processing Store – Falcon, Pig...etc. for “standard” data sets or common “processing
patterns”

Argus– Security

Argus: Security needs are changing
Administration
Centrally management &
consistent security
Authentication
Authenticate users and systems
Authorization
Provision access to data
Audit
Maintain a record of data access
Data Protection
Protect data at rest and in motion
Security needs are changing
• YARN unlocks the data lake
• Multi-tenant: Multiple applications for data access
• Changing and complex compliance environment
• ETL of non-sensitive data can yield sensitive data
Summer 2014
65% of clusters host
multiple workloads
Fall 2013
Largely silo’d deployments
with single workload clusters
5 areas of security focus

Security in Hadoop with HDP + Argus (XA Secure)
Authorization
Restrict access to
explicit data
Audit
Understand who
did what
Data Protection
Encrypt data at
rest & in motion
• Kerberos in native
Apache Hadoop
• HTTP/REST API
Secured with
Apache Knox
Gateway
• HDFS Permissions, HDFS ACL,
• Audit logs in with HDFS & MR
• Hive ATZ-NG
Authentication
Who am I/prove it?
• Wire encryption
in Hadoop
• Open Source
Initiatives
• Partner
Solutions
• HDFS, Hive and
Hbase
• Fine grain
access control
• RBAC
• Centralized
audit reporting
• Policy and
access history
• Future
Integration
Argus Hadoop 2
Centralized Security Administration
• As-Is, works with
current
authentication
methods

Hive– SQL In Hadoop & Roadmap

Hive: The De-Facto SQL Interface for Hadoop
Page 27

Data Abstractions in Hive
Partitions, buckets and skews facilitate
faster, more direct data access. Cube, windowing, aggregation
functions supported as well
Page 28
Database
Table Table
Partition Partition Partition
Bucket
Bucket
Bucket
Optional Per Table
Unskewed Keys Skewed Keys

Stinger.Next - Roadmap

Stinger.Next – Release Cycle

Hive Demo Using DBVisualizer or Excel?

Falcon– Data Governance

Data Pipeline Tracing
Data pipeline
dependencies
Customer
feed
Purchase
feed
Product
feed
Store
feed
View dependencies
between clusters, datasets
and processes
Data pipeline
tagging
Sensitive encrypted
Add arbitrary tags to
feeds & processes
Credit
feed
Data pipeline
audits
Know who modified a
dataset when and into
what
Data pipeline
lineage
File-
1
File-
2
File-
3
Analyze how a dataset
reached a particular
state

Example: Multi-Cluster Replication
Primary Hadoop Cluster
Raw Data
Presented
Data
Cleansed
Data
Conformed
Data
Staged Data
Presented
Data
Replication
Failover Hadoop Cluster
Replication
Bi and Analytic Applications
• Falcon manages workflow and replication
• Enables business continuity without requiring full data reprocessing
• Failover clusters can be smaller than primary clusters
..and many more

Example: Retention
Staged Data
Retention
Policy
Presented
Data
Cleansed
Data
Conformed
Data
Retain 5
Years
Retain Last
Copy Only
Retain 3
Years
Retain 3
Years
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing

Ambari – Hadoop Cluster Monitoring

Ambari Dashboard

Ambari 2H 2014
1.7.0 (September) 1.8.0 (October) 2.0.0 (December)
Features
• Config versioning + history
• Config <final> Properties
• Flume Support
• Ubuntu Support
• ResourceManager HA
• HDFS Rebalance
• Ambari Views Framework
• Slider Support
Tech Preview
• Windows Support
• Ambari Shell
Features
• ServiceX on YARN via Slider
• Log Access + Search
• Rack Awareness
• Simplified Kerberos Setup
• NameNode SafeMode
• Ambari Shell GA
Features
• Automated Rolling Upgrades
• Oozie HA
• Ambari Alerts
• Ambari Metrics
• Windows Support GA

Hadoop 2 Deployment Options

Efficient Data Lakes can Span to the Cloud
On-Premises Cloud
HDP on Windows
HDP on Linux
Your deployment of Hadoop
hosted as a VM in Azure
HDP on Windows
HDP on Linux
Full control of HW and
software configs
1 2
Analytics Platform System
Turnkey Hadoop and
relational warehouse appliance
HDInsight
Managed Hadoop Service
Built on Azure storage
3 4
Enjoy cross-platform interoperability based on 100% open source HDP

Thank You!
Twitter: @rommelgarcia
LinkedIn: /rommelgarcia

Realtime analytics + hadoop 2.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Realtime analytics + hadoop 2.0

Similar to Realtime analytics + hadoop 2.0 (20)

More from Rommel Garcia

More from Rommel Garcia (12)

Recently uploaded

Recently uploaded (20)

Realtime analytics + hadoop 2.0

Editor's Notes