The 20th annual Enterprise Data World (EDW) Conference took place in San Diego last month April 17-21. It is recognized as the most comprehensive educational conference on data management in the world.
Joe Caserta was a featured presenter. His session “Evolving from the Data Warehouse to Big Data Analytics - the Emerging Role of the Data Lake," highlighted the challenges and steps to needed to becoming a data-driven organization.
Joe also participated in in two panel discussions during the show:
• "Data Lake or Data Warehouse?"
• "Big Data Investments Have Been Made, But What's Next
For more information on Caserta Concepts, visit our website at http://casertaconcepts.com/.
3. #EDW16 @joe_Caserta
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
5. #EDW16 @joe_Caserta
The Future of Data is Today
As a Mindful Cyborg, Chris
Dancy utilizes up to
700 sensors, devices,
applications, and services to
track, analyze, and optimize as
many areas of his existence.
Data quantification enables
him to see the connections of
otherwise invisible data,
resulting in dramatic upgrades
to his health, productivity, and
quality of life.
6. #EDW16 @joe_Caserta
The Evolution of Data Analytics
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
7. #EDW16 @joe_Caserta
The Evolution of Data Analytics
Data Analytics Sophistication
Source: Gartner
Reports Correlations Predictions Recommendations
Cognitive Computing / Cognitive Data Analytics
8. #EDW16 @joe_Caserta
Traditional Data Analytics Methods
• Design – Top Down, Bottom Up
• Customer Interviews and requirements gathering
• Data Profiling
• Create Data Models
• Facts and Dimensions
• Extract Transform Load (ETL)
• Copy data from sources to data warehouse
• Data Governance
• Stewardship, business rules, data quality
• Put a BI Tool on Top
• Design semantic layer
• Develop reports
9. #EDW16 @joe_Caserta
A Day in the Life
• Onboarding new data is difficult!
• Rigid Structures and Data Governance
• Disconnected/removed from business requirements:
“Hey – I need to analyze some new data”
IT Conforms and profiles the data
Loads it into dimensional models
Builds a semantic layer nobody is going to use
Creates a dashboard we hope someone will notice
..and then you can access your data 3-6 months later to see if it has value!
10. #EDW16 @joe_Caserta
Houston, we have a Problem: Data Sprawl
• There is one application for every 5-10 employees generating copies of
the same files leading to massive amounts of duplicate idle data strewn all
across the enterprise. - Michael Vizard, ITBusinessEdge.com
• Employees spend 35% of their work time searching for information...
finding what they seek 50% of the time or less.
- “The High Cost of Not Finding Information,” IDC
13. #EDW16 @joe_Caserta
OLD WAY:
• Structure Ingest Analyze
• Fixed Capacity
• Monolithic
NEW WAY:
• Ingest Analyze Structure
• Dynamic Capacity
• Ecosystem
RECIPE:
• Data Lake
• Cloud
• Polyglot Data Landscape
The Paradigm Shift
Big Data is not the problem,
It’s the Change Agent
14. #EDW16 @joe_Caserta
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Data Lake
Canned Reporting
Big Data Analytics
NoSQL
DatabasesETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of Data Analytics
Data Science
15. #EDW16 @joe_Caserta
Innovation is the only sustainable competitive advantage a company can have
Innovations may fail, but companies that don’t innovate will fail
20. #EDW16 @joe_Caserta
Data Munging Versus Reporting
Data Governance
AvailabilityRequirement
Fast
Slow
Minimum Maximum
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
21. #EDW16 @joe_Caserta
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
for Big Data
22. #EDW16 @joe_Caserta
The Big Data Pyramid
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries and
Reporting
Usage Pattern Data Governance
Metadata, ILM,
Security
23. #EDW16 @joe_Caserta
The Data Refinery
• The feedback loop between Data Science, Data Warehouse and Data Lake is
critical
• Ephemeral Data Science Workbench
• Successful work products of science must Graduate into the appropriate layers
of the Data Lake
Cool New
Data
New
Insights
Governance
Refinery
24. #EDW16 @joe_Caserta
Define and Find Your Data
• Data Classification
• Import/Define business taxonomy
• Capture/Automate relationships between data sets
• Integrate metadata with other systems
• Centralized Auditing
• Security access information for every application with data
• Operational information for execution
• Search & Lineage (Browse)
• Predefined navigation paths to explore data
• Text-based search for data elements across data ecosystem
• Browse visualization of data lineage
• Security & Policy Engine
• Rationalize compliance policy at run-time
• Prevent data derivation based on classification (re-classification)
Key Requirements
• Automatic data-
discovery
• Metadata tagging
• Classification
25. #EDW16 @joe_Caserta
Caution: Assembly Required
Some of the most hopeful tools are brand new or in
incubation!
Enterprise big data implementations typically combine
products with custom built components
Tools
People, Processes and Business commitment is still critical!
Data Integration Data Catalog & Governance Emerging Solutions
26. #EDW16 @joe_Caserta
Why Spark
“Big Box” MDM tools vs ROI?
• Prohibitively expensive limited by licensing $$$
• Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is identical
• Beautiful high level API’s
• Databricks cloud is soo Easy
• Full universe of Python modules
• Open source and Free**
• Blazing fast!
Spark has become our default processing engine for a myriad of
engineering problems
27. #EDW16 @joe_Caserta
Spark map operations
Cleansing, transformation, and standardization of both interaction and
customer data
Amazing universe of Python modules:
• Address Parsing: usaddress, postal-address, etc
• Name Hashing: fuzzy, etc
• Genderization: sexmachine, etc
And all the goodies of the standard library!
We can now parallelize our workload against a large number of machines:
28. #EDW16 @joe_Caserta
Existing On-Premise Solution
• Challenges with operations of data servers in Data Center
• Increasing infrastructure complexity
• Keeping up with data growth
Cloud Advantages
• Reduced upfront capital investment
• Faster speed to value
• Elasticity
“Those that go out and buy expensive
infrastructure find that the problem scope and
domain shift really quickly. By the time they get
around to answering the original question, the
business has moved on.” - Matt Wood, AWS
Move to the Cloud?
29. #EDW16 @joe_Caserta
“…any decent sized enterprise will have a variety of different data
technologies for different kinds of data. There will still be large
amounts of it managed in relational stores, but increasingly
we'll be first asking how we want to manipulate the data
and only then figuring out what technology
is the best bet for it.” - Martin Fowler
Think Ecosystem, Not Tech Stack
30. #EDW16 @joe_Caserta
Knowledge Sharing
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Meet monthly to share data best
practices, experiences
• 3,700+ Members
http://www.meetup.com/Big-Data-Warehousing/
http://www.slideshare.net/CasertaConcepts/
Examples of Previous Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data Monitoring
with Storm & Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/Claudia
Perlcih & Revolution Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
34. #EDW16 @joe_Caserta
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
35. #EDW16 @joe_Caserta
Data Quality
Rules Engine
Storm Cluster
Trade
Data
d3.js Real-time
Analytics
Hadoop Cluster
Low Latency
Analytics
Atomic data
Aggregates
Event Monitors
• The Kafka messaging system is used for ingestion
• Storm is used for real-time ETL and outputs atomic data and derived data
needed for analytics
• Redis is used as a reference data lookup cache
• Real time analytics are produced from the aggregated data.
• Higher latency ad-hoc analytics are done in Hadoop using Pig and Hive
Kafka
High Volume Real-time Analytics Architecture
36. #EDW16 @joe_Caserta
Electronic Medical Records (EMR) Analytics
Hadoop Data LakeEdge Node
`
100k
files
variant 1..n
…
variant 1..n
HDFS
Put
Netezza DW
Sqoop
Pig EMR
Processor
UDF
Library
Provider table
(parquet)
Member table
(parquet)
Python Wrapper
Provider table
Member table
Forqlift
Sequence
Files
…
variant 1..n
Sequence
Files
…
15 More
Entities
(parquet)
More
Dimensions
And
Facts
• Receive Electronic Medial Records from various providers in various formats
• Address Hadoop ‘small file’ problem
• No barrier for onboarding and analysis of new data
• Blend new data with Data Lake and Big Data Warehouse
• Machine Learning
• Text Analytics
• Natural Language Processing
• Reporting
• Ad-hoc queries
• File ingestion
• Information Lifecycle Mgmt
37. #EDW16 @joe_Caserta
Data Landscape for Digital Music
Landing
Queue
Data Lake
BDW
Data Science
API
Data Providers
Near Real-time
Batch
Data
Science
Clusters
EDW
Graph
RDS
Metastore