Setting Up the Data Lake

@joe_CasertaPhiladelphia
Setting up the Data Lake
Joe Caserta
President
Caserta Concepts
@joe_Caserta
Philadelphia

Launched Data Science
Data Interaction and Cloud practices
Awarded for getting data out of SAP
for enterprise data analytics
Top 20 Most Most Powerful
Big Data Companies
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Caserta Concepts founded
Web log analytics solution published in Intelligent
Enterprise
Partnered with Big Data vendors Cloudera,
Hortonworks, IBM, Cisco, Datameer, Basho more…
Launched Training practice, teaching and mentoring
data warehousing concepts world-wide
Laser focus on extending Data Warehouses with Big
Data solutions
2001
2010
2004
2012
2009
2014
Launched Big Data Warehousing (BDW)
Meetup - NYC 3,000+ Members
2013
2015
Established best practices for big data ecosystem
implementation – Healthcare, Finance, Insurance
Dedicated to Data Governance Techniques
on Big Data (Innovation)
America’s Fastest Growing Private
Companies - Ranked #740
1996 – Dedicated to Dimensional Data Warehousing
1986 – 1996 OLTP Data Modeling and Reporting.

About Caserta Concepts
• Consulting firm focused on Data Innovation, Modern Data Engineering to solve
highly complex business data challenges
• Award-winning company
• Internationally recognized work force
• Mentoring, Training, Knowledge Transfer
• Strategy, Architecture, Implementation
• Innovation Partner
• Transformative Data Strategies
• Modern Data Engineering
• Advanced Architecture
• Leader in architecting and implementing enterprise data solutions
• Data Warehousing
• Business Intelligence
• Big Data Analytics
• Data Science
• Data on the Cloud
• Data Interaction & Visualization
• Strategic Consulting
• Technical Design
• Build & Deploy Solutions

Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance

Partners

Awards & Recognition

The Future of Data is Today
As a Mindful Cyborg, Chris
Dancy utilizes up to
700 sensors, devices,
applications, and services to
track, analyze, and optimize as
many areas of his existence.
Data quantification enables
him to see the connections of
otherwise invisible data,
resulting in dramatic upgrades
to his health, productivity, and
quality of life.

The Progression of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
Reports  Correlations  Predictions  Recommendations

The Progression of Data Analytics
Source: Gartner
Reports  Correlations  Predictions  Recommendations
Cognitive Computing / Cognitive Data Analytics

Traditional Data Warehousing
• Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
• Create Data Models
• Facts and Dimensions
• Extract Transform Load (ETL)
• Copy data from sources to data warehouse
• Data Governance
• Stewardship, business rules, data quality
• Put a BI Tool on Top
• Design semantic layer
• Develop reports

A Day in the Life
• Onboarding new data is difficult!
• Rigid Structures and Data Governance
• Disconnected/removed from business requirements:
“Hey – I need to analyze some new data”
 IT Conforms and profiles the data
 Loads it into dimensional models
 Builds a semantic layer nobody is going to use
 Creates a dashboard we hope someone will notice
..and then you can access your data 3-6 months later to see if it has value!

Houston, we have a Problem: Data Sprawl
• There is one application for every 5-10 employees generating copies of
the same files leading to massive amounts of duplicate idle data strewn all
across the enterprise. - Michael Vizard, ITBusinessEdge.com
• Employees spend 35% of their work time searching for information...
finding what they seek 50% of the time or less.
- “The High Cost of Not Finding Information,” IDC

OLD WAY:
• Structure  Ingest  Analyze
• Fixed Capacity
• Monolithic
NEW WAY:
• Ingest  Analyze  Structure
• Dynamic Capacity
• Ecosystem
RECIPE:
• Cloud
• Data Lake
• Polyglot Warehouse
The Paradigm Shift
Big Data is not the problem
It’s the Change Agent

Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Data Lake
Canned Reporting
Big Data Analytics
NoSQL
DatabasesETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Modern Data Engineering
Data Science

Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail

Technology:
• Scalable distributed storage  Hadoop, S3
• Pluggable fit-for-purpose processing  Spark, EMR
Functional Capabilities:
• Remove barriers from data ingestion and analysis
• Storage and processing for all data
• Tunable Governance

•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance for the Data Lake

•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for the Data Lake

The Big Data Pyramid
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries and
Reporting
Usage Pattern Data Governance
Metadata, ILM,
Security

Peeling back the layers… The Landing Area
• Source data in it’s full fidelity
• Programmatically Loaded
• Partitioned for data processing
• No governance other than catalog and ILM (Security and Retention)
Consumers: ETL Processes, Applications

Data Lake
• Enriched, lightly integrated
• Data has been is accessible in the Hive Metastore
• Either processed into tabular relations
• Or via Hive Serdes directly upon Raw Data
• Partitioned for data access
• Governance additionally includes a guarantee of
completeness
Consumers: Data Scientists, ETL Processes,
Applications, Data Analysts

Data Science Workspace
• No barrier for onboarding and analysis of new data
• Blending of new data with entire Data Lake, including the Big Data
Warehouse
• Data Scientists enrich data with insight
Consumers: Data Scientists

Big Data Warehouse
• Data is Fully Governed
• Data is Structured
• Partitioned/tuned for data access
• Governance includes a guarantee of completeness and
accuracy
Big
Data
Warehouse
Consumers: Data Scientists, ETL Processes, Applications,
Data Analysts, and Business Users (the masses)

The Refinery
BDW
Data Science
Workspace
Data Lake
Landing Area
Cool
new
data
New
Insights
• The feedback loop between Data Science and Data Warehouse is critical
• Successful work products of science must Graduate into the appropriate
layers of the Data Lake

Polyglot Warehouse
We promote the concept that the Big Data Warehouse may live in one or
more platforms
• Full Hadoop Solutions
• Hadoop plus MPP or Relational
Supplemental technologies:
• NoSQL: Columnar, Key value, Timeseries, Graph
• Search Technologies

Hadoop is the Data Warehouse?
• Hadoop can be the entire data pyramid platform including
landing, data lake and the Big Data Warehouse
• Especially serves as the Data Lake and “Refinery”
• Query engines such as Hive, and Impala provide SQL support

Define and Find Your Data
• Data Classification
• Import/Define business taxonomy
• Capture/Automate relationships between data sets
• Integrate metadata with other systems
• Centralized Auditing
• Security access information for every application with data
• Operational information for execution
• Search & Lineage (Browse)
• Predefined navigation paths to explore data
• Text-based search for data elements across data ecosystem
• Browse visualization of data lineage
• Security & Policy Engine
• Rationalize compliance policy at run-time
• Prevent data derivation based on classification (re-classification)
Key Requirements
• Automatic data-
discovery
• Metadata tagging
• Classification

Caution: Assembly Required
 Some of the most hopeful tools are brand new or in
incubation!
 Enterprise big data implementations typically combine
products with custom built components
Tools
People, Processes and Business commitment is still critical!
Data Integration Data Catalog & Governance Emerging Solutions

Collibra API
Business Glossary
Terms PoliciesWorkflows
API/Exchange ConnectorMDMPower Center Data Quality
Metadata Manager
Active VOS
Systemof
Records
Salesforce
SAP
Workday
Oracle JDE
Analytics
ODS
Data Science
Data Lake
DW
MDM
Domains
Vendor
COA
HR
Customer
Product
Developer Portal
API Management
Security Monitoring & AnalyticsSLA Management
Data Catalog
2
3
Data Sources
1
5
1
4
APILinked/Federated Data Self Service PortalSearch/Visualization
Security &
Entitlements
Publishing Workflows
8
1. Data sources managed
through the MDM
2. Business glossary are mapped
to data sources
3. Business glossary describes
API attributes
4. Data source models used to
develop the APIs
5. All access from the Data
Catalog are through APIs
6. Data catalog utilizes the
business glossary to describe
the data elements
7. Data catalog uses MDM for
lineage
8. Data catalog sources are
defined through and
connected APIs
6
7
Sample Architecture

“…any decent sized enterprise will have a variety of different data
technologies for different kinds of data. There will still be large
amounts of it managed in relational stores, but increasingly
we'll be first asking how we want to manipulate the data
and only then figuring out what technology
is the best bet for it.” - Martin Fowler
Think Ecosystem, Not Tech Stack

Existing On-Premise Solution
• Challenges with operations of Hadoop servers in Data Center
• Increasing infrastructure complexity
• Keeping up with data growth
Cloud Advantages
• Reduced upfront capital investment
• Faster speed to value
• Elasticity
“Those that go out and buy expensive
infrastructure find that the problem scope and
domain shift really quickly. By the time they get
around to answering the original question, the
business has moved on.” - Matt Wood, AWS
Move to the Cloud?

Data Analytics on the Cloud
AWS and other cloud providers present a very powerful design
pattern:
• S3 serves as the storage layer for the Data Lake
• EMR (Elastic Hadoop) provides the Refinery, most clusters can be
ephemeral
• The Active Set is stored into Redshift MPP or Relational Platforms
Eliminate massive on-premise appliance footprint

Landing
Queue
Data Lake
BDW
Data Science
API
Data Providers
Near Real-time
Batch
Data
Science
Clusters
EDW
Graph
RDS
Metastore
A Candidate Future Landscape

Come out and Play
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,000+ Members
http://www.meetup.com/Big-Data-Warehousing/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN

Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics

Electronic Medical Records (EMR) Analytics
Hadoop Data LakeEdge Node
`
100k
files
variant 1..n
…
variant 1..n
HDFS
Put
Netezza DW
Sqoop
Pig EMR
Processor
UDF
Library
Provider table
(parquet)
Member table
(parquet)
Python Wrapper
Provider table
Member table
Forqlift
Sequence
Files
…
variant 1..n
Sequence
Files
…
15 More
Entities
(parquet)
More
Dimensions
And
Facts
• Receive Electronic Medial Records from various providers in various formats
• Address Hadoop ‘small file’ problem
• No barrier for onboarding and analysis of new data
• Blend new data with Data Lake and Big Data Warehouse
• Machine Learning
• Text Analytics
• Natural Language Processing
• Reporting
• Ad-hoc queries
• File ingestion
• Information Lifecycle Mgmt

Setting Up the Data Lake

More Related Content

What's hot

Viewers also liked

Similar to Setting Up the Data Lake

More from Caserta

Recently uploaded

Setting Up the Data Lake