One Large Data Lake, Hold the Hype

One Large Data Lake, Hold the Hype
Rocky Mountain DataCon 2016
Jared Winick
Senior Data/Solutions Engineer, Koverse

2
Outline
• Issues with the usage of “Data Lake”
• Defining Key Characteristics
• A Data Lake Implementation Example
• Discussion

4
Just because “Data Lake” is
overused
misused
abused
doesn’t mean the concept is wrong

5
The Concept of a Data Lake
We all can agree that a Data Lake is a centralized (at least
logically) repository for all forms of data within an organization.

6
https://www.wired.com/2013/04/desktop-cluttered-help/

7
The Concept of a Data Lake
…but there must be more to it than putting all your data in HDFS
or S3.

8
Defining the Key Characteristics
1. Indexing and search across all data
2. Interactive access for all users in the enterprise
3. Multi-level access control
4. Integration with data science tools
5. Abstractions
A Data Lake has a platform-application duality

9
Indexing / Search Across All Data
• A Data Lake is often an entry point for data
• It may lack structure or “correctness”
• Search enables you to validate and explore your data

10
Indexing / Search Across All Data
A18923 Search
{
employeeId: “A18923”,
email: “jaredwinick@koverse.com”,
firstName: “Jared”,
…
}
Employees Data Set
{
id: “a18923”,
eventType: “login”,
time: 1478557775010
…
}
Network Events Data Set
Find data across data sets. Understand its format and structure.

11
Interactive Access for Everyone
• A Data Lake is strategic and should serve many different types
of users.
• Should have self-service features.
• Adds up to needing to support interactive, multi-user load.

12
Multi-Level Access Control
• Every organization has data access control requirements these
days.
• Different level of granularity for different environments/use
cases.
– Data Set Level
– Column/Field Level
– Row/Record Level
• Far easier to engineer up front than add on later.

Name DeptId DOB email
13
Multi-Level Access Control
Data Set
Column
Row

14
Integration With Data Science Tools
• The ultimate point of a Data Lake is to “monetize” data
– For a corporation this is making or saving money
– For a government this is better serving your citizens
– For a research organization this is solving new problems/answering
previously unknown questions
• Need to be able to analyze and transform data sets into new
data sets
• From BI queries to text analytics to machine learning.

15
Integration With Data Science Tools
A Data Lake needs to support multiple internal analytic
“customers” within the organization.
• SQL / BI tools for Data Analysts
• Spark for Data Engineers
• Notebooks and ML libraries for Data Scientists

16
Abstractions
• Provide a level of abstraction over your data
– Data Sets / Collections
– Records / Rows
– Transformations
• Enables a consistent API for interacting with any data
regardless of its shape, size, and content
– Reusability
– Increased development speed

18
Architecture – High Level
HDFS Zookeeper
Accumulo
Spark
Koverse
A distributed key/value store like Apache Accumulo enables storage of very large
volumes of data while maintaining low latency access.

19
Architecture – Distributed Key/Value Store Benefits
These benefits apply to Apache Accumulo, but also likely to Apache HBase,
Cassandra and other similar systems
1. Easily scale to trillions of key/values
2. Distributed storage
1. Parallel processing in Hadoop MapReduce or Spark
2. Fault tolerance
3. Millisecond read latencies with efficient scanning of ranges
4. Fine grained access control features

20
Architecture - Details
Accumulo
Record
Table
Index
Table
Statistics/
Aggregations
Table
Koverse
Low latency R/W
Spark
Efficient Range Scans
Apps
Users/Apps use REST

One Large Data Lake, Hold the Hype

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to One Large Data Lake, Hold the Hype

Similar to One Large Data Lake, Hold the Hype (20)

Recently uploaded

Recently uploaded (20)

One Large Data Lake, Hold the Hype