One Large Data Lake, Hold the Hype
Rocky Mountain DataCon 2016
Jared Winick
Senior Data/Solutions Engineer, Koverse
2
Outline
• Issues with the usage of “Data Lake”
• Defining Key Characteristics
• A Data Lake Implementation Example
• Discussion
3
4
Just because “Data Lake” is
overused
misused
abused
doesn’t mean the concept is wrong
5
The Concept of a Data Lake
We all can agree that a Data Lake is a centralized (at least
logically) repository for all forms of data within an organization.
6
https://www.wired.com/2013/04/desktop-cluttered-help/
7
The Concept of a Data Lake
…but there must be more to it than putting all your data in HDFS
or S3.
8
Defining the Key Characteristics
1. Indexing and search across all data
2. Interactive access for all users in the enterprise
3. Multi-level access control
4. Integration with data science tools
5. Abstractions
A Data Lake has a platform-application duality
9
Indexing / Search Across All Data
• A Data Lake is often an entry point for data
• It may lack structure or “correctness”
• Search enables you to validate and explore your data
10
Indexing / Search Across All Data
A18923 Search
{
employeeId: “A18923”,
email: “jaredwinick@koverse.com”,
firstName: “Jared”,
…
}
Employees Data Set
{
id: “a18923”,
eventType: “login”,
time: 1478557775010
…
}
Network Events Data Set
Find data across data sets. Understand its format and structure.
11
Interactive Access for Everyone
• A Data Lake is strategic and should serve many different types
of users.
• Should have self-service features.
• Adds up to needing to support interactive, multi-user load.
12
Multi-Level Access Control
• Every organization has data access control requirements these
days.
• Different level of granularity for different environments/use
cases.
– Data Set Level
– Column/Field Level
– Row/Record Level
• Far easier to engineer up front than add on later.
Name DeptId DOB email
13
Multi-Level Access Control
Data Set
Column
Row
14
Integration With Data Science Tools
• The ultimate point of a Data Lake is to “monetize” data
– For a corporation this is making or saving money
– For a government this is better serving your citizens
– For a research organization this is solving new problems/answering
previously unknown questions
• Need to be able to analyze and transform data sets into new
data sets
• From BI queries to text analytics to machine learning.
15
Integration With Data Science Tools
A Data Lake needs to support multiple internal analytic
“customers” within the organization.
• SQL / BI tools for Data Analysts
• Spark for Data Engineers
• Notebooks and ML libraries for Data Scientists
16
Abstractions
• Provide a level of abstraction over your data
– Data Sets / Collections
– Records / Rows
– Transformations
• Enables a consistent API for interacting with any data
regardless of its shape, size, and content
– Reusability
– Increased development speed
17
The Koverse Data Lake
18
Architecture – High Level
HDFS Zookeeper
Accumulo
Spark
Koverse
A distributed key/value store like Apache Accumulo enables storage of very large
volumes of data while maintaining low latency access.
19
Architecture – Distributed Key/Value Store Benefits
These benefits apply to Apache Accumulo, but also likely to Apache HBase,
Cassandra and other similar systems
1. Easily scale to trillions of key/values
2. Distributed storage
1. Parallel processing in Hadoop MapReduce or Spark
2. Fault tolerance
3. Millisecond read latencies with efficient scanning of ranges
4. Fine grained access control features
20
Architecture - Details
Accumulo
Record
Table
Index
Table
Statistics/
Aggregations
Table
Koverse
Low latency R/W
Spark
Efficient Range Scans
Apps
Users/Apps use REST
21
Discussion

One Large Data Lake, Hold the Hype

  • 1.
    One Large DataLake, Hold the Hype Rocky Mountain DataCon 2016 Jared Winick Senior Data/Solutions Engineer, Koverse
  • 2.
    2 Outline • Issues withthe usage of “Data Lake” • Defining Key Characteristics • A Data Lake Implementation Example • Discussion
  • 3.
  • 4.
    4 Just because “DataLake” is overused misused abused doesn’t mean the concept is wrong
  • 5.
    5 The Concept ofa Data Lake We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.
  • 6.
  • 7.
    7 The Concept ofa Data Lake …but there must be more to it than putting all your data in HDFS or S3.
  • 8.
    8 Defining the KeyCharacteristics 1. Indexing and search across all data 2. Interactive access for all users in the enterprise 3. Multi-level access control 4. Integration with data science tools 5. Abstractions A Data Lake has a platform-application duality
  • 9.
    9 Indexing / SearchAcross All Data • A Data Lake is often an entry point for data • It may lack structure or “correctness” • Search enables you to validate and explore your data
  • 10.
    10 Indexing / SearchAcross All Data A18923 Search { employeeId: “A18923”, email: “jaredwinick@koverse.com”, firstName: “Jared”, … } Employees Data Set { id: “a18923”, eventType: “login”, time: 1478557775010 … } Network Events Data Set Find data across data sets. Understand its format and structure.
  • 11.
    11 Interactive Access forEveryone • A Data Lake is strategic and should serve many different types of users. • Should have self-service features. • Adds up to needing to support interactive, multi-user load.
  • 12.
    12 Multi-Level Access Control •Every organization has data access control requirements these days. • Different level of granularity for different environments/use cases. – Data Set Level – Column/Field Level – Row/Record Level • Far easier to engineer up front than add on later.
  • 13.
    Name DeptId DOBemail 13 Multi-Level Access Control Data Set Column Row
  • 14.
    14 Integration With DataScience Tools • The ultimate point of a Data Lake is to “monetize” data – For a corporation this is making or saving money – For a government this is better serving your citizens – For a research organization this is solving new problems/answering previously unknown questions • Need to be able to analyze and transform data sets into new data sets • From BI queries to text analytics to machine learning.
  • 15.
    15 Integration With DataScience Tools A Data Lake needs to support multiple internal analytic “customers” within the organization. • SQL / BI tools for Data Analysts • Spark for Data Engineers • Notebooks and ML libraries for Data Scientists
  • 16.
    16 Abstractions • Provide alevel of abstraction over your data – Data Sets / Collections – Records / Rows – Transformations • Enables a consistent API for interacting with any data regardless of its shape, size, and content – Reusability – Increased development speed
  • 17.
  • 18.
    18 Architecture – HighLevel HDFS Zookeeper Accumulo Spark Koverse A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.
  • 19.
    19 Architecture – DistributedKey/Value Store Benefits These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems 1. Easily scale to trillions of key/values 2. Distributed storage 1. Parallel processing in Hadoop MapReduce or Spark 2. Fault tolerance 3. Millisecond read latencies with efficient scanning of ranges 4. Fine grained access control features
  • 20.
  • 21.