Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Copyright 2016 EMC Corporation. All rights reserved.
EMC EMERGING TECHNOLOGIES
ROBERT HOUT - ADVISORY SYSTEMS ENGINEER
@...
What is a Data Lake?
CONNECTED PEOPLE
2.3B
7B
2015 2020
CONNECTED DEVICES
4.9B
30B
2015 2020
DATA ON PLANET
8ZB
44ZB
2015 2020
2 0 2 0 : A N E ...
Why Ingest All The Data?
69%
83%
Source: “The Business of Data” and Economist Intelligence Report, Published Jan 2016
Every Organization Can Gain I...
The Data Lake: Bringing Compute to Data
EDWs	
Marts	 Storage	
Search	
Servers	
Documents	
Archives	
ERP,	CRM,	RDBMS,	Machi...
TO SUCCEED, SIMPLIFY TECHNOLOGY SO YOU
CAN SHIFT FOCUS TO BUSINESS OUTCOMES
KEY CAPABILITIES TO LOOK FOR IN A COMPREHENSIV...
© Copyright 2014 EMC Corporation. All rights reserved.
Why a Data Lake
It delivers comprehensive data services not a point...
The Data lake high-level Vision
• Business-led, cross-functional, methodology focused on short, iterative release cycles
•...
•  Combine different
data sources
•  Minimize data
movement
•  Leverage the
Apache ecosystem
•  Evolve seamlessly
•  Serve...
Security
Business Continuity
Compliance
Tools & Apps
Business Units
Data Migration
PRODUCTION HADOOP HAS SEVERAL CHALLENGE...
System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% ...
•  Virtualization becoming more common
•  Enterprise data management, protection, security
•  SQL on Hadoop the norm
•  Sp...
Traditional Hadoop For The Data Lake?
Direct-attached storage
Stand-alone Servers
Single purpose
All commodity environment...
Hadoop HAS MULTIPLE WORKLOADS
“One size fits all” approach to Hadoop Infrastructure does
not scale for diverse production ...
COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
Social Networks,
User Generated Content
Public records
Loca...
COLLECT, STORE, ANALYZE & USE
Traditional and Emerging Sources
EmergingTraditional
DAS
CLOUD
OBJECTTAPE
SAN
NAS
Isilon
Scale-Out
Data Lake
18
Data Silos vs Consolidated Data Lake
•  One
instance of
the file
services all
dependent
workloads
simultaneo
usly
FILE
19
FILE
EMC Isilon Next-Gen Access Metho...
•  An access zone is:
–  A way to carve the cluster into smaller clusters
–  A way to control access based on individual a...
Data Sharing Across Access Zones
•  Same files can be accessed by
different access zone clients
•  Best for:
–  multi-grou...
DATA LAKE (HADOOP)RDBMS
MACHINE
IOT
STATISTICAL MODELING/NLP VISUALIZATION
TRANSFORM
BI
ORGANIZE MANAGE/
CATALOG
DATA WARE...
DATA LAKE (HADOOP)RDBMS
MACHINE
IOT
STATISTICAL MODELING/NLP VISUALIZATION
TRANSFORM
BI
ORGANIZE MANAGE/
CATALOG
DATA WARE...
A Next Gen Data Lake Architecture
Clickstream	
Web	&	Social	
Geoloca$on	
Sensor	&	Machine	
Server		Logs	
EXISTINGSOURCES
E...
© Copyright 2014 EMC Corporation. All rights reserved.
The Data Lake Vision
Storage Layer
Data Store Manager
3rd Party
ING...
© Copyright 2014 EMC Corporation. All rights reserved.
Ingest Manager
Rapid collection of data from unlimited sources
Appl...
© Copyright 2014 EMC Corporation. All rights reserved.
The Data Governor
Enabling comprehensive data management
Data Catal...
© Copyright 2014 EMC Corporation. All rights reserved.
Data Store Manager
Manage and provision storage for a variety of us...
© Copyright 2014 EMC Corporation. All rights reserved.
Application Provisioning
Application Platform Manager
Rapidly and s...
© Copyright 2014 EMC Corporation. All rights reserved.
An example analytic workflow
From idea to action using the BDL
Appl...
WOULD YOU RATHER
INTEGRATEOR
INNOVATE?
THE DATA LAKE
One Customer’s Journey with Hadoop
Use Cases & Requirements
• As we evaluated business use cases that would support it was determined that
we had a variety o...
Solution Approach
Support
for a
variety of
acquisition
channels
•  3
Common
method for
data types
and
formats
Orchestratio...
Platform Approach
As we better defined and understood use cases and requirements,
it led us down a different path from a p...
Example: Multi-protocol Support
•  3
One of our deployed use cases is multi-protocol support. This
enables us to leverage ...
Example: Multi-distribution Support
•  37
Our organization sells a number of products to the market. Many of
these deploym...
The Isilon Advantage for Hadoop
In-place analytics
•  No data ingest necessary, Isilon provides shared multi-protocol acce...
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Upcoming SlideShare
Loading in …5
×

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

3,816 views

Published on

This session will detail best practices for architecting, building, operating and managing an Analytics Data Lake platform. Key topics will include:

1) Defining next-generation Data Lake architectures. The defacto standard has been commodity DAS servers with HDFS, but there are now multiple solutions aimed at separating compute and storage, virtualizing or containerizing Hadoop applications, and utilizing Hadoop compatible or embedded HDFS filesystems. This portion will explore the options available, and the pros and cons of each.

2) Data Ingest. There are many ways to load data into a Data Lake, including standardized Apache tools (Sqoop, Flume, Kafka, Storm, Spark, NiFi), standard file and object protocols (SFTP, NFS, Rest, WebHDFS), and proprietary tools (eg, Zaloni Bedrock, DataTorrent). This section will explore these options in the context of best fit to workflows; it will also look at key gaps and challenges, particularly in the areas of data formats and integration with metadata/cataloging tools.

3) Metadata & Cataloguing. One of the biggest inhibitors of successful Data Lake deployments is Data Governance, particularly in the areas of indexing, cataloguing and metadata management. It is nearly impossible to run analytics on top of a Data Lake and get meaningful & timely results without solving these problems. This portion will explore both emerging open standards (Apache Atlas, HCatalog) and proprietary tools (Cloudera Navigator, Zaloni Bedrock/Mica, Informatica Metadata Manager), and balance the pros, cons and gaps of each.

4) Security & Access Controls. Solving these challenges are key for adoption in regulatory driven industries like Healthcare & Financial Services. There are multiple Apache projects and proprietary tools to address this, but the challenge is making security and access controls consistent across the entire application and infrastructure stack, and over the data lifecycle, and being able to audit this in the face of legal challenges. This portion will explore available options and best practices.

5) Provisioning & Workflow Management. The real promise of the Data Lake is integrating Analytics workflows and tools on converged infrastructure-with shared data-and build “As A Service” oriented architectures that are oriented towards self-service data exploration and Analytics for end users. This is an emerging and immature area, but this session will explore some potential concepts, tools and options to achieve this.

This will be a moderately technical session, with the above topics being illustrated by real world examples. Attendees should have basic familiarity with Hadoop and the associated Apache projects.

Published in: Technology
  • Instantly Hard-Wire Your Mind For Millionaire SUCCESS ♥♥♥ http://ishbv.com/manifestd1/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You can now be your own boss and get yourself a very generous daily income. START FREE...♥♥♥ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Your opinions matter! get paid BIG $$$ for them! START NOW!!.. ♣♣♣ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

  1. 1. © Copyright 2016 EMC Corporation. All rights reserved. EMC EMERGING TECHNOLOGIES ROBERT HOUT - ADVISORY SYSTEMS ENGINEER @rob_hout ACCELERATING ANALYTICS VALUE WITH A DATA LAKE STAMPEDECON 2016
  2. 2. What is a Data Lake?
  3. 3. CONNECTED PEOPLE 2.3B 7B 2015 2020 CONNECTED DEVICES 4.9B 30B 2015 2020 DATA ON PLANET 8ZB 44ZB 2015 2020 2 0 2 0 : A N E W D I G I T A L W O R L D 3X 6X 5X
  4. 4. Why Ingest All The Data?
  5. 5. 69% 83% Source: “The Business of Data” and Economist Intelligence Report, Published Jan 2016 Every Organization Can Gain Insights 60% Generating revenue from data Starting new BU developing data-related products / services Used data to make existing products / services more profitable Every Organization is A Data Organization
  6. 6. The Data Lake: Bringing Compute to Data EDWs Marts Storage Search Servers Documents Archives ERP, CRM, RDBMS, Machines Files, Images, Video, Logs, Clickstreams External Data Sources Mul$-workload analy$c pla1orm •  Bring applicaDons to data •  Combine different workloads on common data (i.e. SQL + Search) •  True BI agility 4 1 2 1 34 Ac$ve archive •  Full fidelity original data •  Indefinite Dme, any source •  Lowest cost storage 1 Data management, transforma$ons •  One source of data for all analyDcs •  Persisted state of transformed data •  Significantly faster & cheaper 2 Self-service exploratory BI •  Simple search + BI tools •  “Schema on read” agility •  Reduce BI user backlog requests 3
  7. 7. TO SUCCEED, SIMPLIFY TECHNOLOGY SO YOU CAN SHIFT FOCUS TO BUSINESS OUTCOMES KEY CAPABILITIES TO LOOK FOR IN A COMPREHENSIVE BIG DATA SOLUTION INGEST Capture data from a wide range of sources, traditional and new STORE Store everything in one environment for cross data analysis ANALYZE Use advanced algorithms to discover new, predictive patterns SURFACE Share insights with business domain experts ACT Build data-driven applications to meet business needs
  8. 8. © Copyright 2014 EMC Corporation. All rights reserved. Why a Data Lake It delivers comprehensive data services not a point solution •  Our traditional IT customers solve the most pressing issues first, e.g. building a physical Hadoop cluster •  Customers are very good at building parts of a data lake that don’t always align to one another •  Customers struggle with integrating, managing, and deploying the various platforms needed for business analytics •  Customers have little or no overall data governance, but they need it in order to establish a fully functional data lake
  9. 9. The Data lake high-level Vision • Business-led, cross-functional, methodology focused on short, iterative release cycles • Functional distinction between Data Preparation (IT) and Data Usage (Business) • Enabling on-demand services - BI and Analytics sandboxes, tools, and data Self-Service BI and Analytics • The provisioning of data and services to the business independent of data end usage • A key foundation for of Self-Service BI (Data Preparation) • Services can include publication, profiling, archiving, metadata, alerts, and notifications Data as a Service • Alternative to traditional data warehousing focused on agility, flexibility and time to value • Land data ‘as-is’ and transform on demand (‘schema on read’) • Scale out architecture that is adaptive to business cost/performance constraints Business Data Lake Process Technology DataGovernance
  10. 10. •  Combine different data sources •  Minimize data movement •  Leverage the Apache ecosystem •  Evolve seamlessly •  Serve the Enterprise Data Lake implementation strategy needs to… Production Data Web Logs Public Sales Billing CRM SCM Social Media Location Click Streams Sensor Data DATA LAKE
  11. 11. Security Business Continuity Compliance Tools & Apps Business Units Data Migration PRODUCTION HADOOP HAS SEVERAL CHALLENGES Scalability
  12. 12. System Availability Uptime Downtime (per year) 99.999% (AKA 5 nines) 5.26 minutes 99.99% (AKA 4 nines) 52.6 minutes 99.5% 1.83 days 99% (AKA 2 nines) 7.30 days 95% 18.25 days What is your Data Warehouses’ uptime SLA? What is your Hadoop uptime SLA? Why are they different?
  13. 13. •  Virtualization becoming more common •  Enterprise data management, protection, security •  SQL on Hadoop the norm •  Spark exploding –  Generally Lambda architecture, not Spark vs. M/R •  Non-HDFS App Data Integration –  ELK, MongoDB, Cassandra.. •  High performance/ACID/Mem DBs with HDFS Backend •  IoT data collection considerations (HWX Onyara/NiFi) APACHE Ecosystem Trends
  14. 14. Traditional Hadoop For The Data Lake? Direct-attached storage Stand-alone Servers Single purpose All commodity environment Traditional Hadoop Efficiency, Agility, SLAs Rapid deployment Purpose Built Silos Operational Complexity Enterprise Challenges Reintroduces challenges that Enterprise IT solved years ago
  15. 15. Hadoop HAS MULTIPLE WORKLOADS “One size fits all” approach to Hadoop Infrastructure does not scale for diverse production workloads Hadoop Archive Spark HBase SQL-on- Hadoop Hive/Tez MapReduce Geo-Dist Hadoop
  16. 16. COLLECT, STORE, ANALYZE & USE Traditional and Emerging Sources Social Networks, User Generated Content Public records Location DataInternet Of Things Emerging Enterprise File Data Machine Data Traditional Video Archive
  17. 17. COLLECT, STORE, ANALYZE & USE Traditional and Emerging Sources EmergingTraditional DAS CLOUD OBJECTTAPE SAN NAS
  18. 18. Isilon Scale-Out Data Lake 18 Data Silos vs Consolidated Data Lake
  19. 19. •  One instance of the file services all dependent workloads simultaneo usly FILE 19 FILE EMC Isilon Next-Gen Access Methods
  20. 20. •  An access zone is: –  A way to carve the cluster into smaller clusters –  A way to control access based on individual authentication –  OneFS’s Multi-Tenancy solution NFS, SMB, HDFS and OpenStack Swift Access Zones Chez NFSAccess Zone-1 System Zone Access Zone-2 Kerberos-1 Domain Controller-2 LDAP-1 NIS - 1 Group Database - 1 Kerberos-2 Domain Controller-1 Group Database - 2
  21. 21. Data Sharing Across Access Zones •  Same files can be accessed by different access zone clients •  Best for: –  multi-group collaboration w/ untrusted Active Directories –  multi-group data access governed by IP subnet –  Hadoop analytics over multi access zone data •  Uniquely solve collaboration challenge; saves time and money
  22. 22. DATA LAKE (HADOOP)RDBMS MACHINE IOT STATISTICAL MODELING/NLP VISUALIZATION TRANSFORM BI ORGANIZE MANAGE/ CATALOG DATA WAREHOUSESTREAM CEP NEAR REAL-TIME MODELS MAY TAKE HOUR OR DAYS QUERIES MAY RETURN IN SECONDS OR MINUTES SECONDS SEARCH/INDEX ENTERPRISE LOG ANALYSIS APPLICATIONS 3rd PARTY EMAIL SOCIAL MEDIA SQL ON HADOOP THE BIG DATA LANDSCAPE
  23. 23. DATA LAKE (HADOOP)RDBMS MACHINE IOT STATISTICAL MODELING/NLP VISUALIZATION TRANSFORM BI ORGANIZE MANAGE/ CATALOG DATA WAREHOUSESTREAM CEP NEAR REAL-TIME MODELS MAY TAKE HOUR OR DAYS QUERIES MAY RETURN IN SECONDS OR MINUTES SECONDS SEARCH/INDEX ENTERPRISE LOG ANALYSIS APPLICATIONS 3rd PARTY EMAIL SOCIAL MEDIA SQL ON HADOOP THE BIG DATA LANDSCAPE
  24. 24. A Next Gen Data Lake Architecture Clickstream Web & Social Geoloca$on Sensor & Machine Server Logs EXISTINGSOURCES ERP CRM Commodity Compute DATA SERVICES OPERATIONAL SERVICES Hadoop Pla1orm HADOOP CORE Business Analytics Business Analytics Visualization & Dashboards Visualization & Dashboards IT Applications NEWSOURCES 2 3 1 Data Marts Data Management ETL/ELT OFFLOAD ACTIVE ARCHIVE ENRICH WITH NEW DATA TYPES MULTI-PROTOCOL ACCESS ENTERPRISE-GRADE DATA MANAGEMENT 5 NFS, SMB, HTTP, Swift 1 2 3 4 5Isilon 4 New Data Flow Current Data Flow Legend OFFLOAD
  25. 25. © Copyright 2014 EMC Corporation. All rights reserved. The Data Lake Vision Storage Layer Data Store Manager 3rd Party INGEST MANAGER STREAM Exploratory Analytics Isilon XtremIO ECS/ViPR DSSD DATA GOVERNOR SECURITY INDEXING CATALOGING POLICY Modeling Correlations SQL NSQL BATCH Interactive Analytics Aggregates OLAP SQL NSQL Realtime Analytics Modeling Scoring SQL NSQL In MEM Shared Store(s) Private Store(s) FILE COLUMN DB RELATIONAL DB GRAPH DB KEY VALUE DOCUMENT LOGS FILE BATCH SQL ETL MARKETPLACE MANAGERDATA SERVICES PORTAL VNX APPLICATIONS USERS Analytics Platform ManagerVMware Openstack Docker Evo:Rails
  26. 26. © Copyright 2014 EMC Corporation. All rights reserved. Ingest Manager Rapid collection of data from unlimited sources Application Services Portal User Services Portal Application Platform Manager Data Ingest Manager Ingest Application Provisioning Ingest Management and Control Catalog Connector Locality Manager Indexing Connector Security Manager Data Governor
  27. 27. © Copyright 2014 EMC Corporation. All rights reserved. The Data Governor Enabling comprehensive data management Data Catalog Management Security Management Security and Roles LDAP AD BUILT-IN Policy Management Data Types Shared Private Data Sources Consumer Access Rights Compliant Data Sets Encryption/Location Reqs. Lineage Requirements Index Management Licensing Resource Policies Usage Limits Index Management Index Type Index Usage Indexing Resources Index Engine(s) Data Catalog Types Public Private Catalog Type Catalog Usage Catalog Resources Catalog Engine(s) Catalog Security and Roles Catalog Operations Collection Scavenging
  28. 28. © Copyright 2014 EMC Corporation. All rights reserved. Data Store Manager Manage and provision storage for a variety of uses Application Platform Manager Data Store Manager Storage Manager Shared Stores Data GovernorPrivate Stores Compliant Stores Temporary Stores 3rd PartyIsilon XtremIO ECS/ViPR DSSDVNX Storage Configuration and Provisioning Manager
  29. 29. © Copyright 2014 EMC Corporation. All rights reserved. Application Provisioning Application Platform Manager Rapidly and seamlessly deploy applications and resources Application Services Portal User Services Portal Application Platform ManagerApplication Platform Manager Application Provisioning Compute Resource Provisioning Platform Provisioning VM Workflows Provisioning Rules Networking Optimizations Application Deployment Management Package Manager App Store Manager Data Store Manager Data Governor
  30. 30. © Copyright 2014 EMC Corporation. All rights reserved. An example analytic workflow From idea to action using the BDL Application Provisioning User Services Portal Application Platform Manager Platform Provisioning Data Store Manager Data Governor Data Catalog Management Security Management Policy Management Index Management Optimization Engine Recommendation Engine
  31. 31. WOULD YOU RATHER INTEGRATEOR INNOVATE?
  32. 32. THE DATA LAKE One Customer’s Journey with Hadoop
  33. 33. Use Cases & Requirements • As we evaluated business use cases that would support it was determined that we had a variety of workloads with different impacts to the platform Use Cases •  Enterprise Data Hub that can consolidate disparate data sources to a common platform (i.e. data types) •  Migrate Enterprise Data Warehouse (EDW) transient data to lower cost storage platform •  Enable data enrichment services to enable in-record validation, data standardization and analytic processing •  Integrate and provision data to target systems using Hadoop ecosystem components (i.e. Pig, Hive) Requirements •  Ensure that the platform meets both availability and recoverability targets •  Align technology to internal skills and competencies •  Enable existing systems to interoperate with the platform using native protocols or services •  Ability to test and certify commercial products via a multi-distribution environment •  Enable co-resident processing of products to optimize the use of deployed infrastructure •  Ability to provide data protection and isolation of client data within a single instance of the platform (i.e. sub-tenancy
  34. 34. Solution Approach Support for a variety of acquisition channels •  3 Common method for data types and formats Orchestration framework that manages all job execution Includes capabilities around data catalog, file validation, and schema evolution Data integration and provisioning framework Support for relational stores and exploration tools
  35. 35. Platform Approach As we better defined and understood use cases and requirements, it led us down a different path from a platform perspective Data Warehouse Offload Data Integration Enterprise Data Hub Enrichment Validation and Quality ü The ability to independently scale storage and compute ü Provide data protection for critical business information ü Support backup and disaster recovery ü Centrally managed via intuitive user interface ü Leverage existing assets deployed in the enterprise
  36. 36. Example: Multi-protocol Support •  3 One of our deployed use cases is multi-protocol support. This enables us to leverage existing assets and talent in the enterprise but can still leverage the compute paradigm of Hadoop
  37. 37. Example: Multi-distribution Support •  37 Our organization sells a number of products to the market. Many of these deployments are on-premise due to concerns around data privacy or control, data transfer considerations, etc. To support this need a multi-distribution platform was needed that could be used for product certification across similar data sets
  38. 38. The Isilon Advantage for Hadoop In-place analytics •  No data ingest necessary, Isilon provides shared multi-protocol access •  Native integration speeds time to insight Enterprise data protection •  Fast snapshots, backup, and data recovery •  Simple, efficient data replication for disaster recovery Lower costs •  Eliminates the need for dedicated Hadoop infrastructure •  Eliminates 3x mirroring for data protection •  Much more efficient than DAS-based approach Increase flexibility •  Simultaneous support for any Apache-compliant Hadoop distribution •  Collaborative engineering efforts with Cloudera, Hortonworks, and Pivotal •  Ambari integration for management, monitoring, and provisioning Scale-out storage with native Hadoop integration

×