Building a modern data architecture
March 31, 2016
Ben Sharma | CEO and Founder
ben@zaloni.com
•  Award-winning provider of enterprise
data lake management solutions:
Integrated data lake management platform
Self-service data preparation
•  Data Lake Design and Implementation Services
•  Data Science Professional Services
2
Zaloni Proprietary
Delivering on the business of big data
Funded by top-tier technology
investors:
Data lakes will be central to the modern data architecture
Agility Insight Scalability
3
Zaloni Proprietary
•  Store all types data: structured and unstructured data
•  Store raw data in its original form for extended period of time
•  Uses various tools to correlate, enrich and query for insights
on the data
•  Provides democratized access via a single unified view
across the Enterprise
The promise of a data lake: All data is welcome….
Zaloni Proprietary4
Data architecture modernizationTraditionalNew
Data Lake
Sources ETL EDW
Derived
(Transformed)
Discovery Sandbox
EDW
Streaming
Unstructured Data
Various Sources
Zaloni Proprietary
Data Discovery
Analytics
BI
Data Science
Data Discovery
Analytics
BI
5
Data lake challenges and complications
•  Ingestion
•  Lack of Visibility
•  Privacy and Compliance
•  Quality Issues
•  Reliance on IT
•  Reusability
•  Rate of Change
•  Skills Gap
•  Complexity
Building: Managing: Delivering:
Zaloni Proprietary6
Engage the business
• Discover
• Enrich
• Provision
Govern the data in the lake
• Cleanse
• Secure
• Operationalize
Enable the data lake
• Ingest
• Organize
• Catalog
Data lake reference architecture
Consumption
Zone
Source
System
File Data
DB Data
ETL Extracts
Streaming
Transient
Loading Zone
Raw Data
Refined
Data
Trusted
Data
Discovery
Sandbox
Original unaltered
data attributes
Tokenized Data
APIs
Reference Data Master Data
Data Wrangling
Data Discovery
Exploratory Analytics
Metadata Data Quality Data Catalog Security
Data Lake
Integrate to
common format
Data Validation
Data Cleansing
Aggregations
OLTP or ODS
Enterprise Data
Warehouse
Logs
(or other unstructured
data)
Cloud Services
Business Analysts
Researchers
Data Scientists
Zaloni Proprietary
7
Data lake management platform
Unified Data Management
Managed Ingestion
Data Reliability
Data Visibility
Data Security and Privacy
Integrated
Data Lake
Management
Zaloni Proprietary8
•  Ability to ingest vast amounts of data
•  Ability to handle a wide variety of formats
(streaming, files, custom)
•  Ability to handle wide variety of sources
•  Capture operational metadata implicitly
as new data arrives
•  Build in repeatability through automation to pick up
incoming data and apply pre-defined processing
First things first….managed ingestion
Various
Sources
Streaming
Unstructured
Data
Zaloni Proprietary9
•  Reduced time to insight for analytics
•  File and record level watermarking provides data lineage
Capture metadata to improve data visibility and reliability
Type of Metadata Description Example
Technical Captures the form and structure
of each data set
Type of data (text, JSON, Avro), structure
of the data (fields and their types)
Operational Captures lineage, quality, profile
and provenance of the data
Source and target locations of data, size,
number of records, lineage
Business Captures what it all means to the
user
Business names, descriptions, tags,
quality and masking rules
Zaloni Proprietary10
Diagram derived from Gartner report on Self Service Data Preparation
•  Interactive data preparation to address errors, corrupted formats, duplicates
•  Data enrichment to go from raw to refined
•  Self service to prepare data without IT request/SQL knowledge
Data ready: Data preparation required for actionable data
Orchestrate and
automate workflows
Transform Refined
Data
Explore
BI Reports
Enterprise Data
Integrations
Data Science
Data Discovery
Analytics
Raw Data
Automation
Reusable
Transformations
Data Preparation
Zaloni Proprietary11
•  Data lakes enable multiple groups to share access
to centrally stored data
•  Differing permissions require enhanced data security
§  Mask or tokenize data before published in the lake for
consumption
§  Policy-based security
•  Metadata management enables audit and traceability
•  End result: more open and democratized access to
data in the lake for those with permission
Protect sensitive data
Zaloni Proprietary12
Discover, Enrich, Provision
Self Service Data Preparation for Analytics: Catalog, Wrangling, Collaboration
•  See what data is available across your enterprise
•  Blend data in the lake without a costly IT project
•  Perform interactive data-driven transformations
•  Collaborate and share data assets and transformations with peers
EXPLORE PREPARE OPERATIONALIZE
13 Zaloni Proprietary
Catalog with KPIs
Zaloni Confidential and Proprietary14
•  Seeing rapid increase of big data in the Cloud
•  Leverage cloud platforms as complementary to on-premises
•  Support sensitive data on premise and external data in the cloud
(e.g. client data, machine-generated)
Key data challenges for hybrid environments:
“Ground to Cloud” hybrid architectures
Zaloni Proprietary
VISIBILITY GOVERNANCE
Need enterprise-wide data catalog
(logical data lake)
Need consistent data governance
requirements for hybrid platforms
15
INGEST
Manage data ingestion
so you know what is your
Hadoop Data Lake
ORGANIZE
Define and capture
metadata for ease of
searching and browsing
ENRICH
Orchestrate and manage
the data preparation
process
ENGAGE
Data visibility and self-
service data preparation
Manage the complete data pipeline
16
Zaloni Proprietary
Network Data Lake architecture
BI Tools
Network Data Lake
Custom Apps
Data Warehouse
Custom Applications:
•  Subscriber Usage
•  Network Usage Exploration & Ad-hoc Analytics
Data Lake
Manage Ingestion Manage Metadata Manage, Monitor, Schedule
Operations and
Metadata Store
Data Quality &
Rules Engine
Transformation
Engine
Work flow
Executor
Enterprise Data
Warehouse
•  CDR
•  DPI
•  IPFIX
•  SNMP
•  RADIUS
Network Data
•  CRM
•  Billing
•  Inventory
Enterprise Data
Zaloni Proprietary
17
Managed data lake for healthcare payers
Data Lake Management
Edge Node
Data Sources
Relational
Streaming
Files
Data Lake
Configure Ingestion Administer Metadata Manage, Monitor, Schedule
Operations and
Metadata Store
Data Quality &
Rules Engine
Transformation
Engine
Workflow
Executor
Analytical
Applications
Enterprise Data
Warehouse
Consumers
Data Lake
•  Claims
•  EMR
•  Lab/Pathology
•  Pharmacy
•  Member
•  Social
•  Enterprise Data
Applications:
•  HEDIS Reporting
•  Bundle Payments
•  Medical Benefits
Management
•  Scorecards
•  Enterprise Reports
Batch
Ingestion
Streaming
Ingestion
Change Data
Capture
Data Sets:
18
Zaloni Proprietary
Data Lake for BCBS239 Compliance (RDARR)
Register/ update
metadata
RDBMS
Mainframes
Flat files
Binary files
Source Systems
Metadata
repositories
Metadata
Management
solution
Extract/ Read
metadata
Data Ingestion
Data Quality and
Validation
Layout
Standardization
Operational
Metadata
Generation
Data at Rest
Data Acquisition
Automation
•  Automated Data Acquisition Framework providing timeliness of data
•  Capture Metadata in all phases: Ingestion, Transformation
•  Integration with Enterprise Metadata Management
•  Integrated Data Quality Analysis
Zaloni Proprietary
19
Getting Started
Roadmap
Prototype
Analytics Strategy
Business drivers
AND
Business
Questions:
Where is fraud
occurring?
How to optimize
inventory?
Data
Use
Cases
Platform
Subject areas
Source system
Capabilities,
Process
Ingest,
Organize,
Enrich, Explore
Roadmap
Prototype
Analytics Strategy
1Questions 2 Inputs 3 Outcomes
Zaloni Proprietary
20
+ +
=
Stop by booth #1335
and ask for a copy of
our new book and a
free t-shirt!
DON’T GO IN THE DATA
LAKE WITHOUT US
Zaloni Proprietary

Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World San Jose 2016

  • 1.
    Building a moderndata architecture March 31, 2016 Ben Sharma | CEO and Founder ben@zaloni.com
  • 2.
    •  Award-winning providerof enterprise data lake management solutions: Integrated data lake management platform Self-service data preparation •  Data Lake Design and Implementation Services •  Data Science Professional Services 2 Zaloni Proprietary Delivering on the business of big data Funded by top-tier technology investors:
  • 3.
    Data lakes willbe central to the modern data architecture Agility Insight Scalability 3 Zaloni Proprietary
  • 4.
    •  Store alltypes data: structured and unstructured data •  Store raw data in its original form for extended period of time •  Uses various tools to correlate, enrich and query for insights on the data •  Provides democratized access via a single unified view across the Enterprise The promise of a data lake: All data is welcome…. Zaloni Proprietary4
  • 5.
    Data architecture modernizationTraditionalNew DataLake Sources ETL EDW Derived (Transformed) Discovery Sandbox EDW Streaming Unstructured Data Various Sources Zaloni Proprietary Data Discovery Analytics BI Data Science Data Discovery Analytics BI 5
  • 6.
    Data lake challengesand complications •  Ingestion •  Lack of Visibility •  Privacy and Compliance •  Quality Issues •  Reliance on IT •  Reusability •  Rate of Change •  Skills Gap •  Complexity Building: Managing: Delivering: Zaloni Proprietary6 Engage the business • Discover • Enrich • Provision Govern the data in the lake • Cleanse • Secure • Operationalize Enable the data lake • Ingest • Organize • Catalog
  • 7.
    Data lake referencearchitecture Consumption Zone Source System File Data DB Data ETL Extracts Streaming Transient Loading Zone Raw Data Refined Data Trusted Data Discovery Sandbox Original unaltered data attributes Tokenized Data APIs Reference Data Master Data Data Wrangling Data Discovery Exploratory Analytics Metadata Data Quality Data Catalog Security Data Lake Integrate to common format Data Validation Data Cleansing Aggregations OLTP or ODS Enterprise Data Warehouse Logs (or other unstructured data) Cloud Services Business Analysts Researchers Data Scientists Zaloni Proprietary 7
  • 8.
    Data lake managementplatform Unified Data Management Managed Ingestion Data Reliability Data Visibility Data Security and Privacy Integrated Data Lake Management Zaloni Proprietary8
  • 9.
    •  Ability toingest vast amounts of data •  Ability to handle a wide variety of formats (streaming, files, custom) •  Ability to handle wide variety of sources •  Capture operational metadata implicitly as new data arrives •  Build in repeatability through automation to pick up incoming data and apply pre-defined processing First things first….managed ingestion Various Sources Streaming Unstructured Data Zaloni Proprietary9
  • 10.
    •  Reduced timeto insight for analytics •  File and record level watermarking provides data lineage Capture metadata to improve data visibility and reliability Type of Metadata Description Example Technical Captures the form and structure of each data set Type of data (text, JSON, Avro), structure of the data (fields and their types) Operational Captures lineage, quality, profile and provenance of the data Source and target locations of data, size, number of records, lineage Business Captures what it all means to the user Business names, descriptions, tags, quality and masking rules Zaloni Proprietary10
  • 11.
    Diagram derived fromGartner report on Self Service Data Preparation •  Interactive data preparation to address errors, corrupted formats, duplicates •  Data enrichment to go from raw to refined •  Self service to prepare data without IT request/SQL knowledge Data ready: Data preparation required for actionable data Orchestrate and automate workflows Transform Refined Data Explore BI Reports Enterprise Data Integrations Data Science Data Discovery Analytics Raw Data Automation Reusable Transformations Data Preparation Zaloni Proprietary11
  • 12.
    •  Data lakesenable multiple groups to share access to centrally stored data •  Differing permissions require enhanced data security §  Mask or tokenize data before published in the lake for consumption §  Policy-based security •  Metadata management enables audit and traceability •  End result: more open and democratized access to data in the lake for those with permission Protect sensitive data Zaloni Proprietary12
  • 13.
    Discover, Enrich, Provision SelfService Data Preparation for Analytics: Catalog, Wrangling, Collaboration •  See what data is available across your enterprise •  Blend data in the lake without a costly IT project •  Perform interactive data-driven transformations •  Collaborate and share data assets and transformations with peers EXPLORE PREPARE OPERATIONALIZE 13 Zaloni Proprietary
  • 14.
    Catalog with KPIs ZaloniConfidential and Proprietary14
  • 15.
    •  Seeing rapidincrease of big data in the Cloud •  Leverage cloud platforms as complementary to on-premises •  Support sensitive data on premise and external data in the cloud (e.g. client data, machine-generated) Key data challenges for hybrid environments: “Ground to Cloud” hybrid architectures Zaloni Proprietary VISIBILITY GOVERNANCE Need enterprise-wide data catalog (logical data lake) Need consistent data governance requirements for hybrid platforms 15
  • 16.
    INGEST Manage data ingestion soyou know what is your Hadoop Data Lake ORGANIZE Define and capture metadata for ease of searching and browsing ENRICH Orchestrate and manage the data preparation process ENGAGE Data visibility and self- service data preparation Manage the complete data pipeline 16 Zaloni Proprietary
  • 17.
    Network Data Lakearchitecture BI Tools Network Data Lake Custom Apps Data Warehouse Custom Applications: •  Subscriber Usage •  Network Usage Exploration & Ad-hoc Analytics Data Lake Manage Ingestion Manage Metadata Manage, Monitor, Schedule Operations and Metadata Store Data Quality & Rules Engine Transformation Engine Work flow Executor Enterprise Data Warehouse •  CDR •  DPI •  IPFIX •  SNMP •  RADIUS Network Data •  CRM •  Billing •  Inventory Enterprise Data Zaloni Proprietary 17
  • 18.
    Managed data lakefor healthcare payers Data Lake Management Edge Node Data Sources Relational Streaming Files Data Lake Configure Ingestion Administer Metadata Manage, Monitor, Schedule Operations and Metadata Store Data Quality & Rules Engine Transformation Engine Workflow Executor Analytical Applications Enterprise Data Warehouse Consumers Data Lake •  Claims •  EMR •  Lab/Pathology •  Pharmacy •  Member •  Social •  Enterprise Data Applications: •  HEDIS Reporting •  Bundle Payments •  Medical Benefits Management •  Scorecards •  Enterprise Reports Batch Ingestion Streaming Ingestion Change Data Capture Data Sets: 18 Zaloni Proprietary
  • 19.
    Data Lake forBCBS239 Compliance (RDARR) Register/ update metadata RDBMS Mainframes Flat files Binary files Source Systems Metadata repositories Metadata Management solution Extract/ Read metadata Data Ingestion Data Quality and Validation Layout Standardization Operational Metadata Generation Data at Rest Data Acquisition Automation •  Automated Data Acquisition Framework providing timeliness of data •  Capture Metadata in all phases: Ingestion, Transformation •  Integration with Enterprise Metadata Management •  Integrated Data Quality Analysis Zaloni Proprietary 19
  • 20.
    Getting Started Roadmap Prototype Analytics Strategy Businessdrivers AND Business Questions: Where is fraud occurring? How to optimize inventory? Data Use Cases Platform Subject areas Source system Capabilities, Process Ingest, Organize, Enrich, Explore Roadmap Prototype Analytics Strategy 1Questions 2 Inputs 3 Outcomes Zaloni Proprietary 20 + + =
  • 21.
    Stop by booth#1335 and ask for a copy of our new book and a free t-shirt! DON’T GO IN THE DATA LAKE WITHOUT US Zaloni Proprietary