1. INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY11
Evolving Hadoop for the Data Society
Open Platform for Next-Gen Analytics
vin.sharma
strategy & marketing
open source x open data
3. INTEL CONFIDENTIAL3
Virtuous cycle of data-driven innovation
CLOUD
Richer data to
analyze
2.8 Zettabytes of data
generated WW in 20121
CLIENTS
Richer
user experiences
Richer data
from devices
INTELLIGENT SYSTEMS
Sources: (1) IDC Digital Universe 2020, (2) IDC
40 Zettabytes of data will
be generated WW in 20201
4. INTEL CONFIDENTIAL4
Democratize data analysis
Enhance scientific understanding, drive innovation,
and accelerate medical cures
Create new data-driven business models, reduce
resource waste, improve organizational processes
Increase public safety with smart traffic and
improve energy efficiency with smart grids
6. INTEL CONFIDENTIAL6
Data Value
Data Analysis
Data-Intensive Discovery
Drug
Discovery
Life Sciences
Genome
Data
EMR
Clininical
Trials
Sensor
Data
Images
Sim
Data
Physical Sciences
Census
Data
Text
A/V
Surveys
Social Sciences
Treatment
Optimization
Hypothesis
Formation
Modeling &
Prediction
Astronomy
Particle
Physics
Public Policy
Trend
Analysis
Data Management
7. INTEL CONFIDENTIAL7
Value
• Enable researchers to discover biomarkers and
drug targets by correlating genomic data sets
• 90% gain in throughput; 6X data compression
Analytics
• Provide curated data sets with pre-computed
analysis (classification, correlation, biomarkers)
• Provide APIs for applications to combine and
analyze public and private data sets
Data Management
• Use Hive and Hadoop for query and search
• Dynamically partition and scale Hbase
• 10-node cluster / Intel Xeon E5 processors
• 10GbE network
Data-Intensive Discovery: Genomics
Intel Distribution
8. INTEL CONFIDENTIAL8
Data Value
Data Analysis
Data-Driven Business
Customer
Service
Telco
Content CDR
IP
Traffic ShopProduct
Customer
Behavior
Retail
Customer
Behavior
Transactions
FSI
Network
Optimization
Product
Innovation
Market
Insight
Business
Efficiency
Behavior
Modeling
Fraud
Analytics
Client
Engagement
Data Management
9. INTEL CONFIDENTIAL9
Data-Driven Business: Customer Service
Value
• 300 million wireless subscribers
• Enable subscriber access to billing data
• 30X gain in performance; lower TCO
Analytics
• Provides real-time retrieval of 6 months data
• Supports new BI with 15 types of queries
• Enables targeted ad serving and promotions
Data Management
• Use Hadoop/HBase for search and analysis
• 30 TB/month of billing data
• 300K reads/second; 800K inserts/second
• 133-node cluster / Intel Xeon E5 processors CDR
Subscriber Self Service
10. INTEL CONFIDENTIAL10
Data Value
Data Analysis
Data-Rich Communities
Customer
Service
Utilities
Meter
Data
Infrastructure
Data
Monitor
Data
Behavior
Police & Security
ID
Demographics
Government Services
Network
Optimization
Smart
Grids
Safe
Streets
Crime
Detection
Crime
Prevention
Service
Agility
Waste &
Fraud Analysis
Data Management
ID Programs
11. INTEL CONFIDENTIAL11
Data-Rich Communities: Smart City
Value
• Enforce traffic laws and detect license fraud
• Monitor and predict traffic patterns
• In a city of 31 million people
Analytics
• Detect traffic law violations automatically
• Detect driver license fraud by data mining
• Forecast traffic with predictive analytics
Data Management
• 30,000 cameras
• 6Mb/s stream rate per camera
• 15 PB of images in active use
• 2 billion records in HBase
Detection Prevention
Regional
Local
14. INTEL CONFIDENTIAL14
At the intersection of transformative forces
Enabling exascale computing
on massive data sets
Helping enterprises build
open interoperable clouds
Contributing code and
fostering ecosystem
HPC Cloud Open Source
10
18
15. INTEL CONFIDENTIAL15
Intel® Distribution for Apache Hadoop* software
* Other names and brands may be claimed as the property of others.
Hardware-enhanced performance & security
Enables partner innovation in analytics
Strengthens Apache Hadoop* ecosystem
16. INTEL CONFIDENTIAL16
Intel® Distribution for Apache Hadoop* software
version 3.x
All external names and brands are claimed as the property of others.
Intel® Manager for Apache Hadoop software
Deployment, Configuration, Monitoring, Alerts, and Security
HDFS 2.0.3
Hadoop Distributed File System
YARN (MRv2)
Distributed Processing Framework
HBase0.96.1
ColumnarStore
Zookeeper3.4.5
Coordination
Flume1.3.0
LogCollector
Sqoop1.4.1
DataExchange
Pig 0.9.2
Scripting
Hive 0.10.0
SQL Query
Oozie 3.3.0
Workflow
Mahout 0.7
Machine Learning
Hcatalog
Metadata
Connectors
Ingest, Analysis, Visual
Intel proprietary Intel enhancements contributed to open source Open source components included without change
17. INTEL CONFIDENTIAL17
Intel® Distribution for Apache Hadoop* software
version 2.3
• File-based encryption in HDFS
• Up to 20x faster decryption with AES-NI*
• Role-based access control for Hadoop services
• Up to 8.5X faster Hive queries using HBase co-processor
• Adaptive data replication in HDFS and Hbase
• Optimized for SSD with Cache Acceleration Software
• Integrated text search with Lucene
• Simplified deployment & comprehensive monitoring
• Automated configuration with Intel® Active Tuner
• Deployment of HBase across mutiple datacenters
• Detailed profiling of Hadoop jobs
• Simplified design of HBase schemas (+ in 2.4)
• REST APIs for deployment and management (+ in 2.4)
*Based on internal testing
Hardware-enhanced Security
Optimized Performance
Simplified Management
18. INTEL CONFIDENTIAL18
Intel® Distribution for Apache Hadoop* software
version 3.0
• Cell-level ACLs in HBase
• Encryption support in Hive and Pig
• Secure inter-node communication with SSL
• Compression and CRC with SSE 4.2
• Up to 8.5X faster Hive queries using HBase co-processor
• Adaptive replication in HDFS and HBase
• Snapshot support in Hadoop
• SNMP support for monitoring
*Based on internal testing
• Hadoop 2.0.3 and YARN support
• Lustre support
• GlusterFS support
• Hcatalog support
21. INTEL CONFIDENTIAL21
Intel Expressway protects Hadoop APIs
Authn
RBAC
Encryption
Containment
• Enforces consistent security policies across all Hadoop services
• Serves as a trusted proxy to Hadoop, Hbase, and WebHDFS APIs
• Complies with Common Criteria EAL4+, HSM, FIPS 140-2 certifications
• Deploys as software, virtual appliance, or hardware appliance
Hcatalog
Stargate
WebHDFS
Firewall
REST APIs
22. INTEL CONFIDENTIAL22
Kerberos authenticates Hadoop services
Encryption
Containment
Firewall
APIs
Authentication
KDC
request
ticket
send service
ticket
request service
send respose
validate
ticket
4
1
2
3
5 Intel
Manager
• Wizard enables setup of
secure cluster with
encrypted key exchange
• Manager generates principal
and keytab for Hadoop
services
• Manager enables batch
upload of keytab files
23. INTEL CONFIDENTIAL23
Manager simplifies role-based access control
Firewall
AuthZ
• File, table, and service-level controls
• Intel Manager pushes ACLs to each node
24. INTEL CONFIDENTIAL24
Intel Distribution provides HDFS encryption
Firewall
RBAC
• Extends compression codec into crypto codec
• Provides an abstract API for general use
MapReduce
RecordReader
Map
Combiner
Partitioner
Local
Merge & Sort
Reduce
RecordWriter
HDFS
Decrypt
Encrypt
Derivative
Encrypt
Derivative
Decrypt
25. INTEL CONFIDENTIAL25
Intel AES-NI accelerates decryption 20x
64k 4k 1k
AES-NI 460 457 454
No AES-NI 87 87 86
0
50
100
150
200
250
300
350
400
450
500
Speed(MB/s)
AES Encryption
64k 4k 1k
AES-NI 1266 1259 1253
No AES-NI 64 63 63
0
200
400
600
800
1000
1200
1400
Speed(MB/s)
AES Decryption
20X6X
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark*
and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products. For more information go to http://www.intel.com/performance.
• OpenSSL 1.0.1c optimized to use Intel AES-NI (7 math functions in processor accelerate AES)
• Intel Distribution crypto framework uses OpenSSL 1.0.1c
• Patch and design document released to open source (JIRA HADOOP-9331)
26. INTEL CONFIDENTIAL26
Learn more about Intel and Hadoop
• Unique insights that help you tune,
secure, and manage your deployment
in addition to essential understanding
of Apache Hadoop
• Distilled from years of Intel
experience in deploying and
optimizing Apache Hadoop and HBase
for enterprises
• Based on Intel expertise in optimizing
the full Hadoop stack – from Hive on
Hadoop through Java to Linux on x86
hardware
http://hadoop.intel.com
http://www.intel.com/bigdata
Intel Training and Certification Case Studies and Resources
28. INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY2828
Savanna: Hadoop on OpenStack
Ilya Elterman
Senior Director Cloud Services
29. • Dev and QA teams - fast clusters provisioning
• Data Scientists/Analysts - API to run the
analytic jobs with infrastructure provisioning
happening under the hood
• Administrators - centralized cluster
management and monitoring
Hadoop on OpenStack Use Cases
30. Goal is to create native OpenStack component to
provision and operate Hadoop clusters on top of
OpenStack. Key characteristics:
• Open source
• Native for OpenStack
• Support for different Hadoop distributions
• Makes resources dedicated to IaaS cloud
available for Hadoop workloads
Savanna Key Principles
32. Savanna Roadmap
Phase 1 – Completed, April 13th
Basic cluster provisioning with “pre-built” images
Phase 2 – In Progress, July 15th
Pluggable mechanism of integration with vendor tooling
and cluster operations support
Phase 3 – Scoping, 2-3 months
"Analytics as a service” - job execution framework, support
different scripting languages
33. Learn more about Savanna
• All code and documentation open source
• Latest version 0.1.2 from 05/13
• Launchpad home page
• https://launchpad.net/savanna
• Code on stackforge
o Integrated with OpenStack CI/CD
o https://github.com/stackforge/savanna
• Active community
• https://lists.launchpad.net/savanna-all/