Informatica Intelligent Data Lake
Self Service for Data Analysts
Februar, 2017
Sören Eickhoff
Sales Consultant Central Europe
SEickhoff@informatica.com
Data Security
Cloud Data
Management
Big Data
Management
Data
Integration
Master Data
Management
Data Quality
#1 in 6 Data Categories …
Data Platform
Data Lake
Use Case: Data Lake / Data Platform Reference Architecture
Landing Zone
Structured and unstructured enterprise and external data is landed in its raw form,
normalized and ready for use
Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer
Discovery Zone
User sandbox for self-serve access to data for exploration, data blending, hypothesis
testing, analytics, and collaboration
Production Zone
Sanitized transactional, master, and reference data & enriched data models certified for
enterprise use
Machine
Device, Cloud
Documents
and Emails
Relational,
Mainframe
Social Media,
Web Logs Improve
Predictive
Maintenance
Increase
Operational
Efficiency
Increase
Customer
Loyalty
Reduce
Security Risk
Improve
Fraud
Detection
• Can’t easily find trusted data
• Limited access to the data
• Frustrated by slow response from IT
due to long backlog
• Constrained by disparate desktop
tools, manual steps
• No way to collaborate, share, and
update curated datasets
• Can’t cope with growing demand
from the business
• No visibility into what the business
is doing with the data
• Struggling to deliver value to the
business
• Loosing the ability to govern and
manage data as an asset
Challenges Faced by the Business and IT Today
ITData Analysts
Informatica Data Lake Management
Data Lake Management
Enterprise
Information
Catalog
Intelligent
Data Lake
Secure@Source
TITAN
Blaze
Big Data
Management
Intelligent
Streaming
Live Data Map
(metadata integration)
Big Data Management
(data integration)
Data Architect /
Steward
Data Scientist /
Analyst
InfoSec Analyst Data Engineer
 Unified view into enterprise information assets
• Business-user oriented solution
• Semantic search with dynamic facets
• Detailed Lineage and Impact Analysis
• Business Glossary Integration
• Relationships discovery
• High level data profiling
• Automatic Classifications with Data domains
• Business classifications with Custom Attributes
• Broad metadata source connectivity
• Big data scale
Enterprise Information Catalog
 Self-service data preparation with collaborative data governance
• Collaborative project workspaces
• Automated data ingestion
• Search data asset catalog
• Rapid blend of datasets
• Crowd-sourced data asset, tagging & data
sharing
• Automated data asset discovery &
Recommendations
• Rapid ‘industrialization’ of preparation
steps into re-usable workflows
• Complete tracking of usage, lineage, and
security
• Easily support Data Discovery Platforms
Intelligent Data Lake
 Enterprise-wide visibility into sensitive data risks
• Sensitive data classification & discovery
• Sensitive data proliferation analysis
• Who has access to sensitive data
• User activity on sensitive data
• Sensitive Data policy-based alerting
• Multi-factor risk scoring
• Identification of highest risk areas
• Integrates data security information from 3rd parties:
 Data stores, owner, classification
 Protection status
 User access info (LDAP, IAM) and activity logs
(DB, Hadoop, Salesforce, DAM)
Secure@Source
 Easily integrate more data faster from more data sources
Big Data Management
Smart Executor
Informatica Big Data Management
ETL/DI
Servers
Informatica Data
Transformation
Engine on
dedicated DI
servers
Data
Connectivity
Data
Integration
Data
Masking
Data
Quality
Data
Governance
YARN
HDFS
Map
Reduce
Hive on
Map
Reduce
Tez
Spark
Core
Cluster
Aware
Hive
On
Tez
Spark Blaze
Hadoop Cluster
• Visual development interface accelerates
developer productivity
• Near universal data connectivity
• Complex data parsing on Hadoop
• Data profiling on Hadoop
• High-speed data ingestion and extraction
• Process and deliver data at scale on
Hadoop
• Dynamic schemas and mapping
templates
• Data Quality and Data Governance on
Hadoop
Take Big Data Management to the Next Level
Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic
Automatically profit from new technologies and choose best option - Smart Optimizer
MapReduce
Spark
Blaze
Generic source Generic targetRule based logic
Informatica Intelligent Streaming
• Streaming analytics capability
into the Intelligent Data Platform
• Unified UI with multiple engines
underneath the covers
• Frictionless integration
conversion/extension of batch
mappings into streaming context
• Abstracted from runtime
framework
 Collect, ingest and process data in realtime and streaming
Realtime
source
Realtime
target
Window
transformation
Spark
Streaming
code generated
Intelligent Datalake – Deep Dive
12
Data
Analyst / Scientist
Who?
Prepare & Publish
Search & Discover
Share and Collaborate
Intelligent Data Lake
How?
Applications &
Databases
Internet of Things
3rd Party Data
Data Modeling
Tools BI Tools CustomCloud
Data Access & Metadata Connectivity
Intelligent Metadata FoundationCatalog ClassifyIndex
Data
Lineage
Data
Relationships
Smart
Domains
Data
Profile
Data Discovery & Analysis Process
Recommend
Discover Collaborate
Publish
Operationalize/
Monitor
Prepare
Data
Analyst / Scientist
Intelligent Data Lake
Data Asset
- Data you work with as a unit
Project
- A project contains
data assets and worksheets.
Recipe
- The steps taken to prepare
data in a worksheet.
Data Publication
- the process of making prepared
data available in the data lake
Data Preparation
- The process of combining, cleansing,
transforming, and structuring data from one
or more data assets so that it is ready
for analysis.
Terminology
Intelligent Data Lake
Search and Discovery
Data discovery through a powerful search engine to find relevant data
Semantic
search
Fact filtering by
asset, resource Type,
latest , size, custom
attributes…
Data Asset Overview
Overview with asset attributes and integrated profiling stats
Asset attributes
collected from the
source system
Asset attributes
enriched by users to
add business context
Column profiling stats
including
Null/Unique/Duplicate
percentages, Inferred
data types and data
domains.
Details stats include
value and pattern
distributions
Add data asset
To Project from
any exploration
views
Business Glossary Integration
View Business
Glossary Assets
like Terms,
Policies and
Categories in the
Catalog
View and
navigate
to related
technical
and
business
assets in
the
catalog
Data Lineage
Interactively trace data origin through summarized lineage views for analysts
Use Lineage and Impact Sliders to drill down to
desired lineage levels on either side of the seed
object.
Relationship View
Shows ecosystem of the asset in the enterprise based on association to other assets
Get a 360 Degree View
of data asset using the
relationship view.
Includes related tables,
views, domains and
reports, users etc.
Ability to Zoom, find specific assets
in the view and filter by asset types
Expand relationship
circles to get more
details on relationship
types and objects.
Data Preparation continued…
Excel-based data preparation on Sample data
New formula
definition with
type-ahead
Large number of
functions
available for all
types of data
string, numeric,
date, statistical,
Math etc.
Advanced
functionality
such as Join,
Merge,
Aggregate,
Filter, Sort etc.
New values are
calculated and
shown right
away
Data Preparation continued…
Excel-based data preparation on Sample data
Column
level
summary
Column value
distributions
Column level
Suggestions
Data
preparation
steps
captured as
“Recipe”
Data Publication
Execution of data preparation steps on actual data using Infa mapping
Publish the output of
data preparation steps
back to the lake
Recipe steps are
translated into
Informatica mapping
Informatica mapping is
handed over to BDM
platform for execution on
actual data sources
BDM platform uses either
Map/Reduce or Blaze or
Spark to execute the
mapping
Mapping is available to
the ETL specialists to
open in Informatica
Developer tool to
operationalize
Users credentials are
used to access the
underlying database.
Organizations need ONE solution that helps them…
Easily Find &
Catalog Data &
Discover
Relationships
Rapidly Prepare &
Share Data Exactly
When it is Needed
Get instant Access to
Trusted & Secure
Data for Advanced
Analytics
Ingest, Cleanse, Integrate & protect data at scale
Forrester Wave™: Big Data Fabric, Q4 ’16
Questions ?

Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self Service for Data Analysts"

  • 1.
    Informatica Intelligent DataLake Self Service for Data Analysts Februar, 2017 Sören Eickhoff Sales Consultant Central Europe SEickhoff@informatica.com
  • 2.
    Data Security Cloud Data Management BigData Management Data Integration Master Data Management Data Quality #1 in 6 Data Categories …
  • 3.
    Data Platform Data Lake UseCase: Data Lake / Data Platform Reference Architecture Landing Zone Structured and unstructured enterprise and external data is landed in its raw form, normalized and ready for use Data AnalystData Scientist BusinessData StewardData Modeler Data Engineer Discovery Zone User sandbox for self-serve access to data for exploration, data blending, hypothesis testing, analytics, and collaboration Production Zone Sanitized transactional, master, and reference data & enriched data models certified for enterprise use Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Improve Predictive Maintenance Increase Operational Efficiency Increase Customer Loyalty Reduce Security Risk Improve Fraud Detection
  • 4.
    • Can’t easilyfind trusted data • Limited access to the data • Frustrated by slow response from IT due to long backlog • Constrained by disparate desktop tools, manual steps • No way to collaborate, share, and update curated datasets • Can’t cope with growing demand from the business • No visibility into what the business is doing with the data • Struggling to deliver value to the business • Loosing the ability to govern and manage data as an asset Challenges Faced by the Business and IT Today ITData Analysts
  • 5.
    Informatica Data LakeManagement Data Lake Management Enterprise Information Catalog Intelligent Data Lake Secure@Source TITAN Blaze Big Data Management Intelligent Streaming Live Data Map (metadata integration) Big Data Management (data integration) Data Architect / Steward Data Scientist / Analyst InfoSec Analyst Data Engineer
  • 6.
     Unified viewinto enterprise information assets • Business-user oriented solution • Semantic search with dynamic facets • Detailed Lineage and Impact Analysis • Business Glossary Integration • Relationships discovery • High level data profiling • Automatic Classifications with Data domains • Business classifications with Custom Attributes • Broad metadata source connectivity • Big data scale Enterprise Information Catalog
  • 7.
     Self-service datapreparation with collaborative data governance • Collaborative project workspaces • Automated data ingestion • Search data asset catalog • Rapid blend of datasets • Crowd-sourced data asset, tagging & data sharing • Automated data asset discovery & Recommendations • Rapid ‘industrialization’ of preparation steps into re-usable workflows • Complete tracking of usage, lineage, and security • Easily support Data Discovery Platforms Intelligent Data Lake
  • 8.
     Enterprise-wide visibilityinto sensitive data risks • Sensitive data classification & discovery • Sensitive data proliferation analysis • Who has access to sensitive data • User activity on sensitive data • Sensitive Data policy-based alerting • Multi-factor risk scoring • Identification of highest risk areas • Integrates data security information from 3rd parties:  Data stores, owner, classification  Protection status  User access info (LDAP, IAM) and activity logs (DB, Hadoop, Salesforce, DAM) Secure@Source
  • 9.
     Easily integratemore data faster from more data sources Big Data Management Smart Executor Informatica Big Data Management ETL/DI Servers Informatica Data Transformation Engine on dedicated DI servers Data Connectivity Data Integration Data Masking Data Quality Data Governance YARN HDFS Map Reduce Hive on Map Reduce Tez Spark Core Cluster Aware Hive On Tez Spark Blaze Hadoop Cluster • Visual development interface accelerates developer productivity • Near universal data connectivity • Complex data parsing on Hadoop • Data profiling on Hadoop • High-speed data ingestion and extraction • Process and deliver data at scale on Hadoop • Dynamic schemas and mapping templates • Data Quality and Data Governance on Hadoop
  • 10.
    Take Big DataManagement to the Next Level Improving developer productivity – Dynamic Mappings Re-use PowerCenter & SQL Logic Automatically profit from new technologies and choose best option - Smart Optimizer MapReduce Spark Blaze Generic source Generic targetRule based logic
  • 11.
    Informatica Intelligent Streaming •Streaming analytics capability into the Intelligent Data Platform • Unified UI with multiple engines underneath the covers • Frictionless integration conversion/extension of batch mappings into streaming context • Abstracted from runtime framework  Collect, ingest and process data in realtime and streaming Realtime source Realtime target Window transformation Spark Streaming code generated
  • 12.
  • 13.
    Data Analyst / Scientist Who? Prepare& Publish Search & Discover Share and Collaborate Intelligent Data Lake
  • 14.
    How? Applications & Databases Internet ofThings 3rd Party Data Data Modeling Tools BI Tools CustomCloud Data Access & Metadata Connectivity Intelligent Metadata FoundationCatalog ClassifyIndex Data Lineage Data Relationships Smart Domains Data Profile Data Discovery & Analysis Process Recommend Discover Collaborate Publish Operationalize/ Monitor Prepare Data Analyst / Scientist Intelligent Data Lake
  • 15.
    Data Asset - Datayou work with as a unit Project - A project contains data assets and worksheets. Recipe - The steps taken to prepare data in a worksheet. Data Publication - the process of making prepared data available in the data lake Data Preparation - The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis. Terminology Intelligent Data Lake
  • 16.
    Search and Discovery Datadiscovery through a powerful search engine to find relevant data Semantic search Fact filtering by asset, resource Type, latest , size, custom attributes…
  • 17.
    Data Asset Overview Overviewwith asset attributes and integrated profiling stats Asset attributes collected from the source system Asset attributes enriched by users to add business context Column profiling stats including Null/Unique/Duplicate percentages, Inferred data types and data domains. Details stats include value and pattern distributions Add data asset To Project from any exploration views
  • 18.
    Business Glossary Integration ViewBusiness Glossary Assets like Terms, Policies and Categories in the Catalog View and navigate to related technical and business assets in the catalog
  • 19.
    Data Lineage Interactively tracedata origin through summarized lineage views for analysts Use Lineage and Impact Sliders to drill down to desired lineage levels on either side of the seed object.
  • 20.
    Relationship View Shows ecosystemof the asset in the enterprise based on association to other assets Get a 360 Degree View of data asset using the relationship view. Includes related tables, views, domains and reports, users etc. Ability to Zoom, find specific assets in the view and filter by asset types Expand relationship circles to get more details on relationship types and objects.
  • 21.
    Data Preparation continued… Excel-baseddata preparation on Sample data New formula definition with type-ahead Large number of functions available for all types of data string, numeric, date, statistical, Math etc. Advanced functionality such as Join, Merge, Aggregate, Filter, Sort etc. New values are calculated and shown right away
  • 22.
    Data Preparation continued… Excel-baseddata preparation on Sample data Column level summary Column value distributions Column level Suggestions Data preparation steps captured as “Recipe”
  • 23.
    Data Publication Execution ofdata preparation steps on actual data using Infa mapping Publish the output of data preparation steps back to the lake Recipe steps are translated into Informatica mapping Informatica mapping is handed over to BDM platform for execution on actual data sources BDM platform uses either Map/Reduce or Blaze or Spark to execute the mapping Mapping is available to the ETL specialists to open in Informatica Developer tool to operationalize Users credentials are used to access the underlying database.
  • 24.
    Organizations need ONEsolution that helps them… Easily Find & Catalog Data & Discover Relationships Rapidly Prepare & Share Data Exactly When it is Needed Get instant Access to Trusted & Secure Data for Advanced Analytics Ingest, Cleanse, Integrate & protect data at scale
  • 25.
    Forrester Wave™: BigData Fabric, Q4 ’16
  • 26.

Editor's Notes

  • #3 If your customer thinks of Informatica as an ETL company, this is a chance to change their perception. We are the #1 leader in 6 important data categories: First, cloud data management – we have a full portfolio of data management services for all the major cloud ecosystems – either cloud only or hybrid Data integration – our bread and butter – we have been the best at it for a long time and we continue to set the bar Big Data Management – we are the leader in data management for Big Data platforms. We work closely with all the major Hadoop, NoSQL ecosystems and with all the latest Big Data technologies like Spark Master Data Management – we are the leader in MDM for customer data and any other data that is important to their business. Our secret sauce is our matching engine, ability to discover relationships, and scalability. We can do this on any data platform, either on-premise or in the Cloud. Data Quality – we are setting the bar in DQ. Whether it is for stand-alone initiatives like data governance or for embedding data quality into their business processes Data Security – we are pioneering a new approach in security. Security remains an unsolved problem, and we can address it at the data level
  • #4 Most organization are building out some version of a data lake or enterprise data hub concept. Really they are looking to get all their data into one place for next generation of analytics, ability for all people to have access to information They are usually divided up into multiple types of zones.
  • #6 To serve these market trends best Informatica developed a Big Data solution that addresses each of the trends. The EIC module helps people understand the data they are looking at providing context The IDL module allows business to be more self service by providing self service data preparation capabilities, yet also helps IT operationalize the data preparation steps at scale in a managed and governed way. Secure@Source gives insight in potential risks around privacy sensitive data, by providing insight in where this data is located, how it is proliferated across the Data Lake (and surrounding applications) and what the associated risks are. Big Data Management helps customers ingest, parse, cleanse, integrate and deliver big data at scale. Intelligent Streaming finally allows processes to use realtime and streaming data sources. All this fucntionality is built as part of the Intelligent Data Platform where we try to use as much open source tools as possible leveraging the power of the ecosystem. We use Hbase to store different types of metadata, we use Titan as a graph database to store the relationship information between data assets. We use Spark (incl Spark Streaming) and Blaze to process data at Scale, we use Kafka as a high speed data transfer mechanism and finally we use Solr to index metadata so it can be searched using a google like search interface.
  • #7 The Enterpsie Information Catalog (EIC) application allows business users to quicly find all information around the collection of data assets in their data lake. Since EIC can leverage the metadata provided by Cloudera Navigator we can even show Hive/Impala scripts and Pig scripts that are being used to process data.
  • #8 Intelligent data lake provides capabilities to enable business users to do data preparation
  • #9 Secure@Source gives insight in sensitive data risks.
  • #10 Secure@Source gives insight in sensitive data risks.
  • #11 Dynamic Mappings Build a template once – automate mapping execution for 1000’s of sources with different schemas automatically Mapping self adjusts dynamically to external schema changes and column characteristics Ability to process flat files with changing order (a,b,c or c,a,b) and number of columns dynamically Re-use PowerCenter and SQL logic Many customers have existing investments done in traditional powercenter and/or SQL scripts. To allow re-use of these components Informatica provides capabilities to migrate existing PowerCenter logic to run in Hadoop and to convert existing SQL code to Big Data mapping logic that can be executed at scale. Smart Optimizer In-built mapping optimizer that automatically tunes and re-arranges the mapping for high performance Early selection, Early projection, Mapping pruning, Semi-join, Join re-ordering, etc Automatic partitioning support based on statistics and other heuristics Advanced full pushdown optimization support including data ship join
  • #12 Intelligent streaming aims to bring the following capabilities into the Informatica Platform: Real-time data ingestion from streaming data sources Rule evaluation and event triggering on a real-time data stream Real-time Data Integration: complex transforms, lookups, joins etc. in real time
  • #15 Data Stewards are responsible for strategically managing data assets in the data lake and the enterprise ensuring high levels of data quality, integrity, availability, trustworthiness, and data security while emphasizing the business value of data. By building a catalog, classifying metadata and data definitions, maintaining technical and business rules and monitoring data quality, data stewards ensure data in the lake is consistent for use in the discovery zone and enterprise zone. As the inventory of technical and business metadata is established and data sets available, data architects must design robust scalable data lake architecture to meets the business goals of the marketing data lake.
  • #16 Before we dive into the demo, lets look at some terminology, I will be using these terms quite a bit in the demo: Data Lake A data lake is a centralized repository of large volumes of structured and unstructured data. A data lake can contain different types of data, including raw data, refined data, master data, transactional data, log file data, and machine data. In Intelligent Data Lake, the data lake is a Hadoop cluster. Data Asset A data asset is data that you work with as a unit. Data assets can include items such as a flat file, table, or view. A data asset can include data stored in or outside the data lake. Project A project is a container stores data assets and worksheets. Data Preparation The process of combining, cleansing, transforming, and structuring data from one or more data assets so that it is ready for analysis. Recipe A recipe includes the list of input sources and the steps taken to prepare data in a worksheet. Data Publication data publication is the process of making prepared data available in the data lake. When you publish prepared data, Intelligent Data Lake writes the transformed input source to a Hive table in the data lake. Other analysts can add the published data to their projects and create new data assets. Or analysts can use a third-party business intelligence or advanced analytic tool to run reports to further analyze the published data.