2. 2EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Welcome !
Dr. Stefan Radtke
CTO Isilon, EMEA
EMC Emerging Technology Division
- 1995-2011: 17 Years for IBM in various technical roles
- 2011: Joined EMC
- 2012-today: CTO, EMEA for EMC Insilon
Phone: +49-176-34434460
E-Mail: Stefan.Radtke@emc.com
Linkedin: http://de.linkedin.com/in/drstefanradtke
Blog: http://stefanradtke.blogspot.com
3. 3EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% 1.83 days
99% (AKA 2 nines) 7.30 days
95% 18.25 days
What is your Data Warehouses’ uptime SLA?
What is your Hadoop uptime SLA?
Why are they different?
4. 4EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
We have good Hadoop Outcomes
Smart Grid
Fraud / Broken Devices & Grid Traffic Projections
Fraud
Healthcare research
Genomes and Healthcare – BRCA
Connected Car - Tesla
5. 5EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Hadoop takes on DB like Features
• Newly Added Features in Hadoop 3.0
– Erasure Coding (HDFS-EC / HDFS-7485) is being introduced
to Hadoop
– Additional Stand By Name Nodes for increase resiliency
(HDFS-6440)
• Future Features
– Random read support from Indexed Name Node – (HDFS-
8555)
– Disaster Recovery (HDFS-5442)
6. 6EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
So...
• IF Hadoop is the Modern Database
AND
• IF Hadoop is taking on more Modern Database Features
AND
• Successful Outcomes are becoming more prolific...
Why are Operations of Hadoop and Uptime / SLAs seem
like such an afterthought on most clusters?
7. 7EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
KPIs
• Why do companies who have VERY successful Data
Warehouses, ETL processes, and KPI Dashboards
have so little of THOSE for their Hadoop instance
which is now generating all their Machine Learning
and Data & Analytics?
8. 8EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What can go wrong?
• Forbes: “..haven’t taken into account
some long-term or ongoing cost associated
with the project…”
• Information Week: “…Unanticipated
problems beyond the big data
technology…”
• Computerworld: “…there are enterprises
that underestimated the paradigm shift…”
9. 9EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
An Intervention
• Why is the concept of 99.99% seem bad for a
production Hadoop system?
• Why is solid KPIs around data collection and capture
sound absurd?
• Since when did a backup copy or backup of your
primary analytics data become not needed?
• Is this just because Hadoop is about standing up cheap
hardware?
• Why do companies need a catalyst before these things
seem common again?
10. 10EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Why wouldn’t you want:
• Two clusters fully addressable with data
replication located in separate geographies
• Data Re-silvering when additional capacity is
added
• Complete fault tolerance in the environment
and not just Data / Node redundancy to allow 4
Nines availability
• Operational scale that allows 24 x 7 support
EMPTYEMPTYEMPTYEMPTYEMPTYFULLFULLFULLFULLBALANCEDBALANCEDBALANCEDBALANCEDBALANCED
11. 11EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 1
• Separation of compute and storage.
– Why do you think the cloud Hadoop is able to offer better SLAs
then on premise Hadoop? It isn’t because of a ton of single point
of failure compute boxes. They separate compute and storage.
• Look at Infrastructure / Big Data as a service centralization
– Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize
the team and provide QoS back to the applications
12. 12EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Data Gravity
• Data sets get bigger over time, and moving them becomes
increasingly difficult
– This leads to switching costs & lock in
• Data is a strategic asset to enterprises with digital strategies
• Data becomes central – build around it
– Applications tend to migrate toward the data
– Apply advanced analytics to the data “in-place”
14. 14EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
THE PROBLEM OF DATA MOVEMENT
• To get statistically relevant results, a typical minimal required
data set is about 100 TB.
• That’s also the recommendet minimal Hadoop cluster size
• To copy 100TB over a dedicated 10 GBE link takes about 24
hours.
You need a Data Lake that unserstands Posix/Windows
and HDFS to avoid data movement (=In-place Analytics)
15. 15EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
EMC DATA LAKE
Isilon
Servers
Applications
Finance Marketing Operations Sales
Servers Servers Servers Servers
CRMERP SCM CRM
Servers Servers Servers
Analytics + Mobile Applications
• Data Lake
Servers Servers Servers Servers
17. 17EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Isilon Data Lake Architecture
ClientsC
LAN
C
Clients
Clients
Isilon Node
GB/10GB
Ethernet
Isilon
SAS
Isilon Node
SAS
Isilon Node
SAS
Infiniband
Scale out Data Lake
OneFS integrates RAID, Volume Manager and
Filesystem.
Uses internal disk and spawns a single
filesystem accross disks
Development start in the 2000‘s
Extremly mature, based on FreeBSD
Supports many access protocols
…
Scale Out
Clients
Clients
LAN
18. 18EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
• Multi-threaded daemon runs on all nodes
– Services both NN and DN protocols
– Translates HDFS RPCs to POSIX system calls
– Stateless, underlying FS handles coherency
HDFS Implementation as a Protocol
OneFS Node
isi_hdfs_d
Thread
Request VFS
OneFS
Syscall
Response
19. 19EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
HDFS IMPLEMENTED LIKE A NAS PROTOCOL
OneFS runs a daemon that
speaks NameNode and
DataNode natively
OneFS Clustered FileSystem
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
Hadoop
Node
DFSClient
1) Request(“/file”)
2) Response
(block locations) 3) GetBlock(block)
23. 23EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
CLOUDPOOLS
DATA CENTER
23
CLOUD PROVIDER
APPS &
USERS
Access time
CLOUD ENABLED DATA LAKE
24. 24EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Parallel Replication
Designed ground-up for scale-out storage
Aggregate throughput scales with capacity
Maintain consistent RPO over growing data sets
Underlying FS knowledge
– Snapshot integration
– Block-level deltas
– Rich meta-data transfer
Automated Data Failover/Failback
25. 25EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Storage Considerations
STANDARD HADOOP CLUSTER
HADOOP USING EMC ISILON DATA LAKE
100 Nodes
Compute + DAS
24 TB per Node
/3 for
Hadoop
Copies
800TB Usable,
but rarely
achieved
5+
Cabinets
Spill space for
ingestion and
extraction
20 Nodes
Compute +
800TB Isilon
Single Copy with
Erasure Coding
800TB
Usable
1 Cabinet It is NAS
26. 26EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 2
• Build a fully functioning cost model that includes all
items you think are “free”, but costs stop when you
change the Architecture.
– Project based funding is great until you want to centralize.
Centralization models (BDaaS) work when you consider all
the sundry costs typically excluded by project based
funding (i.e., 24 x 7 support for each cluster, all in costs
that appear free but are sunk)
27. 27EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
What is my Idea - 3
• Think about “build all yourself” vs. “buy”
• Focus on Analytics rather than infrastructure implementation,
software dependency, testing,.... etc.
• That has all been done already with EMC Big Data Systems and
Big Data Solutions
• Using pre-validated, installed and tested solutions reduces
complexity and increases reliability.
28. 28EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
EMC BIG DATA PORTFOLIO
• Data Lake
• Data Lake Extensions
• Cloud Enabled
• Vblock
• VxRack
• VxRail
• Federation Business
Data Lake
29. 29EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
HIGH PERFORMANCE
PREDICTABLE, LOW LATENCY
HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000µsHDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
HDFS
PCIe
<100µs
DSSD
✓HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000µsSDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
DSSD Hadoop
Plugin accesses
flash directly
• 10X Throughput
• 1/13th Latency
• No Application
Changes Required
30. 30EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
P I V O T A L B I G D A T A S U I T E
V M W A R E V C L O U D S U I T E
EMC DATA LAKE FOUNDATION: ISILON + ECS
VCE VBLOCK | XTREMIO | DATA DOMAIN
O P E N
A N A L Y T I C S
T O O L B O X
D A T A A N D A N A L Y T I C S C A T A L O G
A D V A N C E D A N A L Y T I C SA P P L I C A T I O N S
A T S C A L E
D A T A
P R O C E S S I N G
GREENPLUM
DATABASE
HAWQ
SPRING XD PIVOTAL HDSPARK
REDIS
RABBITMQ
GEMFIRE
BDS ON PIVOTAL
CLOUD FOUNDRY
H A D O O P
PLATFORMMANAGER
DATAGOVERNOR
DATA
MANAGER
INGEST
MANAGER
ANALYTICS
MANAGER
EMC Business Data Lake
Look Demos at http://www.fbdldemo.com/
31. 31EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Thursday, April 14th, 15:00 UTC
Watch out for :
• Hadoop Everywhere: Geo-Distributed Storage
for Big Data
Pesenters:
• Nikhil Joshi, EMC
• Vishrut Shah,EMC
32.
33. 33EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
A Remark on data locality
• U. C. Berkeley’s AMP Labs declared Data locality dead in
2011
• Cloudera has declared data locality dead in Hadoop 3.0
with HDFS-EC.
• Gartner has declared hadoop dead due to its limits
• Hadoop will only grow and have more dependency on it
going forward.
• A catalyst may be the next time I see you and uptime for
hadoop is your main concern.
34. 34EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Simple to manage
Single file system, single volume, global namespace
Massively scalable
Scales from 16 TB to over 50 PB in a single cluster
200GB/s throughput, 3.75M IOPS
Unmatched efficiency
Over 80% storage utilization, automated tiering and SmartDedupe
Enterprise data protection
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
Robust security and compliance options
RBAC, Access Zones, WORM data security, File System Auditing
Data At Rest Encryption with SEDs, STIG hardening
CAC/PIV Smartcard authentication, FIPS OpenSSL support
Operational flexibility
Multi-protocol support including NFS, SMB, HTTP, FTP and HDFS
Object and Cloud computing including OpenStack Swift
Isilon Scale-Out NAS
35. 35EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Geo-Scale
Geo-Replicated and Distributed to multiple locations
Massively scalable
Scales to billions of objects in a single namespace
Support for all file sizes
Support for individual files of any size.
Multi-Tenant
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
HDFS Compatible
Hortonworks Certified HDFS Compatible File System
Swift Compatible
Natively support Open Stack storage
Native Cloud Interface
Natively works with existing cloud protocols like S3 and Azure.
Elastic Cloud Storage (ECS)