Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

© Copyright 2016 EMC Corporation. All rights reserved.
EMC EMERGING TECHNOLOGIES
ROBERT HOUT - ADVISORY SYSTEMS ENGINEER
@rob_hout
ACCELERATING ANALYTICS
VALUE WITH A DATA LAKE
STAMPEDECON 2016

CONNECTED PEOPLE
2.3B
7B
2015 2020
CONNECTED DEVICES
4.9B
30B
2015 2020
DATA ON PLANET
8ZB
44ZB
2015 2020
2 0 2 0 : A N E W D I G I T A L W O R L D
3X 6X 5X

69%
83%
Source: “The Business of Data” and Economist Intelligence Report, Published Jan 2016
Every Organization Can Gain Insights
60% Generating revenue
from data
Starting new BU developing
data-related products / services
Used data to make existing products /
services more profitable
Every
Organization is A
Data
Organization

The Data Lake: Bringing Compute to Data
EDWs
Marts Storage
Search
Servers
Documents
Archives
ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources
Mul$-workload analy$c pla1orm
•  Bring applicaDons to data
•  Combine different workloads on
common data (i.e. SQL + Search)
•  True BI agility
4
1
2
1
34
Ac$ve archive
•  Full fidelity original data
•  Indefinite Dme, any source
•  Lowest cost storage
1
Data management, transforma$ons
•  One source of data for all analyDcs
•  Persisted state of transformed data
•  Significantly faster & cheaper
2
Self-service exploratory BI
•  Simple search + BI tools
•  “Schema on read” agility
•  Reduce BI user backlog requests
3

TO SUCCEED, SIMPLIFY TECHNOLOGY SO YOU
CAN SHIFT FOCUS TO BUSINESS OUTCOMES
KEY CAPABILITIES TO LOOK FOR IN A COMPREHENSIVE BIG
DATA SOLUTION
INGEST
Capture data from
a wide range of
sources, traditional
and new
STORE
Store everything in
one environment for
cross data analysis
ANALYZE
Use advanced
algorithms to discover
new, predictive
patterns
SURFACE
Share insights
with business
domain experts
ACT
Build data-driven
applications to meet
business needs

Why a Data Lake
It delivers comprehensive data services not a point solution
•  Our traditional IT customers solve the most pressing issues first, e.g.
building a physical Hadoop cluster
•  Customers are very good at building parts of a data lake that don’t always
align to one another
•  Customers struggle with integrating, managing, and deploying the various
platforms needed for business analytics
•  Customers have little or no overall data governance, but they need it in
order to establish a fully functional data lake

The Data lake high-level Vision
• Business-led, cross-functional, methodology focused on short, iterative release cycles
• Functional distinction between Data Preparation (IT) and Data Usage (Business)
• Enabling on-demand services - BI and Analytics sandboxes, tools, and data
Self-Service BI and Analytics
• The provisioning of data and services to the business independent of data end usage
• A key foundation for of Self-Service BI (Data Preparation)
• Services can include publication, profiling, archiving, metadata, alerts, and notifications
Data as a Service
• Alternative to traditional data warehousing focused on agility, flexibility and time to value
• Land data ‘as-is’ and transform on demand (‘schema on read’)
• Scale out architecture that is adaptive to business cost/performance constraints
Business Data Lake
Process
Technology
DataGovernance

•  Combine different
data sources
•  Minimize data
movement
•  Leverage the
Apache ecosystem
•  Evolve seamlessly
•  Serve the
Enterprise
Data Lake implementation strategy needs to…
Production
Data
Web
Logs
Public
Sales
Billing
CRM
SCM
Social
Media
Location
Click
Streams
Sensor
Data
DATA LAKE

Security
Business Continuity
Compliance
Tools & Apps
Business Units
Data Migration
PRODUCTION HADOOP HAS SEVERAL CHALLENGES
Scalability

System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% 1.83 days
99% (AKA 2 nines) 7.30 days
95% 18.25 days
What is your Data Warehouses’ uptime SLA?
What is your Hadoop uptime SLA?
Why are they different?

•  Virtualization becoming more common
•  Enterprise data management, protection, security
•  SQL on Hadoop the norm
•  Spark exploding
–  Generally Lambda architecture, not Spark vs. M/R
•  Non-HDFS App Data Integration
–  ELK, MongoDB, Cassandra..
•  High performance/ACID/Mem DBs with HDFS Backend
•  IoT data collection considerations (HWX Onyara/NiFi)
APACHE Ecosystem Trends

Traditional Hadoop For The Data Lake?
Direct-attached storage
Stand-alone Servers
Single purpose
All commodity environment
Traditional Hadoop
Efficiency, Agility, SLAs
Rapid deployment
Purpose Built Silos
Operational Complexity
Enterprise Challenges
Reintroduces challenges that Enterprise IT solved years ago

Hadoop HAS MULTIPLE WORKLOADS
“One size fits all” approach to Hadoop Infrastructure does
not scale for diverse production workloads
Hadoop
Archive
Spark
HBase
SQL-on-
Hadoop
Hive/Tez
MapReduce
Geo-Dist
Hadoop

COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
Social Networks,
User Generated Content
Public records
Location DataInternet Of Things
Emerging
Enterprise File Data
Machine Data
Traditional
Video Archive

COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
EmergingTraditional
DAS
CLOUD
OBJECTTAPE
SAN
NAS

Isilon
Scale-Out
Data Lake
18
Data Silos vs Consolidated Data Lake

•  One
instance of
the file
services all
dependent
workloads
simultaneo
usly
FILE
19
FILE
EMC Isilon Next-Gen Access Methods

•  An access zone is:
–  A way to carve the cluster into smaller clusters
–  A way to control access based on individual authentication
–  OneFS’s Multi-Tenancy solution
NFS, SMB, HDFS and OpenStack Swift
Access Zones
Chez
NFSAccess
Zone-1
System
Zone
Access
Zone-2
Kerberos-1
Domain Controller-2
LDAP-1 NIS - 1
Group Database - 1
Kerberos-2
Domain Controller-1
Group Database - 2

Data Sharing Across Access Zones
•  Same files can be accessed by
different access zone clients
•  Best for:
–  multi-group collaboration w/
untrusted Active Directories
–  multi-group data access
governed by IP subnet
–  Hadoop analytics over multi
access zone data
•  Uniquely solve collaboration
challenge; saves time and
money

DATA LAKE (HADOOP)RDBMS
MACHINE
IOT
STATISTICAL MODELING/NLP VISUALIZATION
TRANSFORM
BI
ORGANIZE MANAGE/
CATALOG
DATA WAREHOUSESTREAM
CEP
NEAR
REAL-TIME
MODELS MAY TAKE HOUR OR DAYS
QUERIES MAY RETURN IN SECONDS OR MINUTES
SECONDS
SEARCH/INDEX
ENTERPRISE LOG ANALYSIS
APPLICATIONS
3rd PARTY
EMAIL
SOCIAL
MEDIA
SQL ON HADOOP
THE BIG DATA LANDSCAPE

A Next Gen Data Lake Architecture
Clickstream
Web & Social
Geoloca$on
Sensor & Machine
Server Logs
EXISTINGSOURCES
ERP
CRM
Commodity Compute
DATA
SERVICES
OPERATIONAL
SERVICES
Hadoop Pla1orm
HADOOP CORE
Business
Analytics
Business
Analytics
Visualization
& Dashboards
Visualization
& Dashboards
IT
Applications
NEWSOURCES
2
3
1
Data
Marts
Data
Management
ETL/ELT OFFLOAD
ACTIVE ARCHIVE
ENRICH WITH NEW
DATA TYPES
MULTI-PROTOCOL
ACCESS
ENTERPRISE-GRADE
DATA MANAGEMENT
5
NFS, SMB,
HTTP, Swift
1
2
3
4
5Isilon
4
New Data Flow
Current Data
Flow
Legend
OFFLOAD

The Data Lake Vision
Storage Layer
Data Store Manager
3rd Party
INGEST
MANAGER
STREAM
Exploratory Analytics
Isilon XtremIO ECS/ViPR DSSD
DATA
GOVERNOR
SECURITY
INDEXING
CATALOGING
POLICY
Modeling
Correlations
SQL NSQL BATCH
Interactive Analytics
Aggregates
OLAP
SQL NSQL
Realtime Analytics
Modeling
Scoring
SQL NSQL In MEM
Shared Store(s) Private Store(s)
FILE COLUMN DB RELATIONAL DB GRAPH DB KEY VALUE DOCUMENT
LOGS
FILE
BATCH
SQL
ETL
MARKETPLACE MANAGERDATA SERVICES PORTAL
VNX
APPLICATIONS USERS
Analytics Platform ManagerVMware Openstack Docker Evo:Rails

Ingest Manager
Rapid collection of data from unlimited sources
Application Services Portal User Services Portal
Application Platform Manager
Data Ingest Manager
Ingest Application Provisioning
Ingest Management and Control
Catalog Connector Locality Manager
Indexing Connector Security Manager
Data Governor

The Data Governor
Enabling comprehensive data management
Data Catalog Management
Security Management
Security and Roles
LDAP AD BUILT-IN
Policy Management
Data Types
Shared Private
Data Sources
Consumer Access Rights
Compliant Data Sets
Encryption/Location Reqs.
Lineage Requirements
Index Management
Licensing
Resource Policies
Usage Limits
Index Management
Index Type
Index Usage
Indexing Resources
Index Engine(s)
Data Catalog Types
Public Private
Catalog Type
Catalog Usage
Catalog Resources
Catalog Engine(s)
Catalog Security and Roles
Catalog Operations
Collection Scavenging

Data Store Manager
Manage and provision storage for a variety of uses
Data Store Manager
Storage Manager
Shared Stores
Data
GovernorPrivate Stores Compliant Stores Temporary Stores
3rd PartyIsilon XtremIO ECS/ViPR DSSDVNX
Storage Configuration and Provisioning Manager

Application Provisioning
Rapidly and seamlessly deploy applications and resources
Application Services Portal User Services Portal
Application Platform ManagerApplication Platform Manager
Compute Resource Provisioning
Platform Provisioning
VM Workflows
Provisioning Rules
Networking
Optimizations
Application Deployment Management
Package Manager App Store Manager
Data Store Manager
Data Governor

An example analytic workflow
From idea to action using the BDL
User Services Portal
Platform Provisioning
Data Store Manager
Data Governor
Data Catalog Management
Security Management
Policy Management
Index Management
Optimization Engine
Recommendation Engine

WOULD YOU RATHER
INTEGRATEOR
INNOVATE?

THE DATA LAKE
One Customer’s Journey with Hadoop

Use Cases & Requirements
• As we evaluated business use cases that would support it was determined that
we had a variety of workloads with different impacts to the platform
Use Cases
•  Enterprise Data Hub that can consolidate
disparate data sources to a common
platform (i.e. data types)
•  Migrate Enterprise Data Warehouse (EDW)
transient data to lower cost storage
platform
•  Enable data enrichment services to enable
in-record validation, data standardization
and analytic processing
•  Integrate and provision data to target
systems using Hadoop ecosystem
components (i.e. Pig, Hive)
Requirements
•  Ensure that the platform meets both
availability and recoverability targets
•  Align technology to internal skills and
competencies
•  Enable existing systems to interoperate with
the platform using native protocols or
services
•  Ability to test and certify commercial
products via a multi-distribution
environment
•  Enable co-resident processing of products to
optimize the use of deployed infrastructure
•  Ability to provide data protection and
isolation of client data within a single
instance of the platform (i.e. sub-tenancy

Solution Approach
Support
for a
variety of
acquisition
channels
•  3
Common
method for
data types
and
formats
Orchestration
framework that
manages all job
execution
Includes
capabilities around
data catalog, file
validation, and
schema evolution
Data integration
and provisioning
framework
Support for relational
stores and exploration
tools

Platform Approach
As we better defined and understood use cases and requirements,
it led us down a different path from a platform perspective
Data
Warehouse
Offload
Data
Integration
Enterprise
Data
Hub
Enrichment
Validation
and Quality
ü The ability to independently scale storage
and compute
ü Provide data protection for critical business
information
ü Support backup and disaster recovery
ü Centrally managed via intuitive user interface
ü Leverage existing assets deployed in the
enterprise

Example: Multi-protocol Support
•  3
One of our deployed use cases is multi-protocol support. This
enables us to leverage existing assets and talent in the enterprise
but can still leverage the compute paradigm of Hadoop

Example: Multi-distribution Support
•  37
Our organization sells a number of products to the market. Many of
these deployments are on-premise due to concerns around data
privacy or control, data transfer considerations, etc. To support this
need a multi-distribution platform was needed that could be used
for product certification across similar data sets

The Isilon Advantage for Hadoop
In-place analytics
•  No data ingest necessary, Isilon provides shared multi-protocol access
•  Native integration speeds time to insight
Enterprise data protection
•  Fast snapshots, backup, and data recovery
•  Simple, efficient data replication for disaster recovery
Lower costs
•  Eliminates the need for dedicated Hadoop infrastructure
•  Eliminates 3x mirroring for data protection
•  Much more efficient than DAS-based approach
Increase flexibility
•  Simultaneous support for any Apache-compliant Hadoop distribution
•  Collaborative engineering efforts with Cloudera, Hortonworks, and Pivotal
•  Ambari integration for management, monitoring, and provisioning
Scale-out storage with native Hadoop integration

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Similar to Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016