Are you spending your summer down by the Data Lake? If so, then you want to make certain that the lake is clean and that you pick the best place to swim. The Data Lake is the new analytical paradise that many organizations are banking on to become that answer to improved insights. And you need to prevent the lake from turning swampy.
In this month’s RWDG webinar, Bob Seiner and a special guest will focus on how to govern the data in your Data Lake. Bob’s interaction with his guests is always lively, fact filled and this month they will help you to successfully swim through major barriers to provide an effective and valuable data resource.
In this webinar, Bob and his guest will discuss:
- The relationship between Data Lakes and Data Governance
- Preventing your Data Lake from becoming a Data Swamp
- Governing the Metadata associated with your Data Lake
- Leveraging governed data to provide trustworthy Analytics
- Measuring the value of a governed Data Lake
3. 4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
4. Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE
5. Data Orchestration Framework
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
6. Alluxio’s Approach to Big Data Federation
Unified Access - Acts as a “virtual data lake.” Files are accessed in Alluxio’s
global namespace as if they resided in a single system
Performant - Provides fast local access to important and frequently used data,
without maintaining a permanent copy of all data.
Modern, flexible architecture - Promotes separation of compute from storage
Storage Cost Optimization -Transparently reads and writes data directly
from the source system, and so does not need to create a permanent copy of
the data
7. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Key Innovations of the Data Orchestration Layer
8. Use Cases Data Orchestration Enables
Hive
Alluxio
Run big data workloads in hybrid
cloud environments
On premise
Same instance
/ container
Spark
Alluxio
Any Cloud / Multi Cloud
Same data
center / region
PrestoSpark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Enable big data on object stores
across single or multiple clouds
Standalone
9. Incredible Open Source Momentum with growing community
900+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack