The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
5. 69%
83%
Source: “The Business of Data” and Economist Intelligence Report, Published Jan 2016
Every Organization Can Gain Insights
60% Generating revenue
from data
Starting new BU developing
data-related products / services
Used data to make existing products /
services more profitable
Every
Organization is A
Data
Organization
6. The Data Lake: Bringing Compute to Data
EDWs
Marts Storage
Search
Servers
Documents
Archives
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Mul$-workload analy$c pla1orm
• Bring applicaDons to data
• Combine different workloads on
common data (i.e. SQL + Search)
• True BI agility
4
1
2
1
34
Ac$ve archive
• Full fidelity original data
• Indefinite Dme, any source
• Lowest cost storage
1
Data management, transforma$ons
• One source of data for all analyDcs
• Persisted state of transformed data
• Significantly faster & cheaper
2
Self-service exploratory BI
• Simple search + BI tools
• “Schema on read” agility
• Reduce BI user backlog requests
3
7. TO SUCCEED, SIMPLIFY TECHNOLOGY SO YOU
CAN SHIFT FOCUS TO BUSINESS OUTCOMES
KEY CAPABILITIES TO LOOK FOR IN A COMPREHENSIVE BIG
DATA SOLUTION
INGEST
Capture data from
a wide range of
sources, traditional
and new
STORE
Store everything in
one environment for
cross data analysis
ANALYZE
Use advanced
algorithms to discover
new, predictive
patterns
SURFACE
Share insights
with business
domain experts
ACT
Build data-driven
applications to meet
business needs
9. The Data lake high-level Vision
• Business-led, cross-functional, methodology focused on short, iterative release cycles
• Functional distinction between Data Preparation (IT) and Data Usage (Business)
• Enabling on-demand services - BI and Analytics sandboxes, tools, and data
Self-Service BI and Analytics
• The provisioning of data and services to the business independent of data end usage
• A key foundation for of Self-Service BI (Data Preparation)
• Services can include publication, profiling, archiving, metadata, alerts, and notifications
Data as a Service
• Alternative to traditional data warehousing focused on agility, flexibility and time to value
• Land data ‘as-is’ and transform on demand (‘schema on read’)
• Scale out architecture that is adaptive to business cost/performance constraints
Business Data Lake
Process
Technology
DataGovernance
10. • Combine different
data sources
• Minimize data
movement
• Leverage the
Apache ecosystem
• Evolve seamlessly
• Serve the
Enterprise
Data Lake implementation strategy needs to…
Production
Data
Web
Logs
Public
Sales
Billing
CRM
SCM
Social
Media
Location
Click
Streams
Sensor
Data
DATA LAKE
12. System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% 1.83 days
99% (AKA 2 nines) 7.30 days
95% 18.25 days
What is your Data Warehouses’ uptime SLA?
What is your Hadoop uptime SLA?
Why are they different?
13. • Virtualization becoming more common
• Enterprise data management, protection, security
• SQL on Hadoop the norm
• Spark exploding
– Generally Lambda architecture, not Spark vs. M/R
• Non-HDFS App Data Integration
– ELK, MongoDB, Cassandra..
• High performance/ACID/Mem DBs with HDFS Backend
• IoT data collection considerations (HWX Onyara/NiFi)
APACHE Ecosystem Trends
14. Traditional Hadoop For The Data Lake?
Direct-attached storage
Stand-alone Servers
Single purpose
All commodity environment
Traditional Hadoop
Efficiency, Agility, SLAs
Rapid deployment
Purpose Built Silos
Operational Complexity
Enterprise Challenges
Reintroduces challenges that Enterprise IT solved years ago
15. Hadoop HAS MULTIPLE WORKLOADS
“One size fits all” approach to Hadoop Infrastructure does
not scale for diverse production workloads
Hadoop
Archive
Spark
HBase
SQL-on-
Hadoop
Hive/Tez
MapReduce
Geo-Dist
Hadoop
16. COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
Social Networks,
User Generated Content
Public records
Location DataInternet Of Things
Emerging
Enterprise File Data
Machine Data
Traditional
Video Archive
17. COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
EmergingTraditional
DAS
CLOUD
OBJECTTAPE
SAN
NAS
19. • One
instance of
the file
services all
dependent
workloads
simultaneo
usly
FILE
19
FILE
EMC Isilon Next-Gen Access Methods
20. • An access zone is:
– A way to carve the cluster into smaller clusters
– A way to control access based on individual authentication
– OneFS’s Multi-Tenancy solution
NFS, SMB, HDFS and OpenStack Swift
Access Zones
Chez
NFSAccess
Zone-1
System
Zone
Access
Zone-2
Kerberos-1
Domain Controller-2
LDAP-1 NIS - 1
Group Database - 1
Kerberos-2
Domain Controller-1
Group Database - 2
21. Data Sharing Across Access Zones
• Same files can be accessed by
different access zone clients
• Best for:
– multi-group collaboration w/
untrusted Active Directories
– multi-group data access
governed by IP subnet
– Hadoop analytics over multi
access zone data
• Uniquely solve collaboration
challenge; saves time and
money
22. DATA LAKE (HADOOP)RDBMS
MACHINE
IOT
STATISTICAL MODELING/NLP VISUALIZATION
TRANSFORM
BI
ORGANIZE MANAGE/
CATALOG
DATA WAREHOUSESTREAM
CEP
NEAR
REAL-TIME
MODELS MAY TAKE HOUR OR DAYS
QUERIES MAY RETURN IN SECONDS OR MINUTES
SECONDS
SEARCH/INDEX
ENTERPRISE LOG ANALYSIS
APPLICATIONS
3rd PARTY
EMAIL
SOCIAL
MEDIA
SQL ON HADOOP
THE BIG DATA LANDSCAPE
23. DATA LAKE (HADOOP)RDBMS
MACHINE
IOT
STATISTICAL MODELING/NLP VISUALIZATION
TRANSFORM
BI
ORGANIZE MANAGE/
CATALOG
DATA WAREHOUSESTREAM
CEP
NEAR
REAL-TIME
MODELS MAY TAKE HOUR OR DAYS
QUERIES MAY RETURN IN SECONDS OR MINUTES
SECONDS
SEARCH/INDEX
ENTERPRISE LOG ANALYSIS
APPLICATIONS
3rd PARTY
EMAIL
SOCIAL
MEDIA
SQL ON HADOOP
THE BIG DATA LANDSCAPE
24. A Next Gen Data Lake Architecture
Clickstream
Web & Social
Geoloca$on
Sensor & Machine
Server Logs
EXISTINGSOURCES
ERP
CRM
Commodity Compute
DATA
SERVICES
OPERATIONAL
SERVICES
Hadoop Pla1orm
HADOOP CORE
Business
Analytics
Business
Analytics
Visualization
& Dashboards
Visualization
& Dashboards
IT
Applications
NEWSOURCES
2
3
1
Data
Marts
Data
Management
ETL/ELT OFFLOAD
ACTIVE ARCHIVE
ENRICH WITH NEW
DATA TYPES
MULTI-PROTOCOL
ACCESS
ENTERPRISE-GRADE
DATA MANAGEMENT
5
NFS, SMB,
HTTP, Swift
1
2
3
4
5Isilon
4
New Data Flow
Current Data
Flow
Legend
OFFLOAD
33. Use Cases & Requirements
• As we evaluated business use cases that would support it was determined that
we had a variety of workloads with different impacts to the platform
Use Cases
• Enterprise Data Hub that can consolidate
disparate data sources to a common
platform (i.e. data types)
• Migrate Enterprise Data Warehouse (EDW)
transient data to lower cost storage
platform
• Enable data enrichment services to enable
in-record validation, data standardization
and analytic processing
• Integrate and provision data to target
systems using Hadoop ecosystem
components (i.e. Pig, Hive)
Requirements
• Ensure that the platform meets both
availability and recoverability targets
• Align technology to internal skills and
competencies
• Enable existing systems to interoperate with
the platform using native protocols or
services
• Ability to test and certify commercial
products via a multi-distribution
environment
• Enable co-resident processing of products to
optimize the use of deployed infrastructure
• Ability to provide data protection and
isolation of client data within a single
instance of the platform (i.e. sub-tenancy
34. Solution Approach
Support
for a
variety of
acquisition
channels
• 3
Common
method for
data types
and
formats
Orchestration
framework that
manages all job
execution
Includes
capabilities around
data catalog, file
validation, and
schema evolution
Data integration
and provisioning
framework
Support for relational
stores and exploration
tools
35. Platform Approach
As we better defined and understood use cases and requirements,
it led us down a different path from a platform perspective
Data
Warehouse
Offload
Data
Integration
Enterprise
Data
Hub
Enrichment
Validation
and Quality
ü The ability to independently scale storage
and compute
ü Provide data protection for critical business
information
ü Support backup and disaster recovery
ü Centrally managed via intuitive user interface
ü Leverage existing assets deployed in the
enterprise
36. Example: Multi-protocol Support
• 3
One of our deployed use cases is multi-protocol support. This
enables us to leverage existing assets and talent in the enterprise
but can still leverage the compute paradigm of Hadoop
37. Example: Multi-distribution Support
• 37
Our organization sells a number of products to the market. Many of
these deployments are on-premise due to concerns around data
privacy or control, data transfer considerations, etc. To support this
need a multi-distribution platform was needed that could be used
for product certification across similar data sets
38. The Isilon Advantage for Hadoop
In-place analytics
• No data ingest necessary, Isilon provides shared multi-protocol access
• Native integration speeds time to insight
Enterprise data protection
• Fast snapshots, backup, and data recovery
• Simple, efficient data replication for disaster recovery
Lower costs
• Eliminates the need for dedicated Hadoop infrastructure
• Eliminates 3x mirroring for data protection
• Much more efficient than DAS-based approach
Increase flexibility
• Simultaneous support for any Apache-compliant Hadoop distribution
• Collaborative engineering efforts with Cloudera, Hortonworks, and Pivotal
• Ambari integration for management, monitoring, and provisioning
Scale-out storage with native Hadoop integration