Advanced Analytics and
Machine Learning with
Data Virtualization
Alex Hoehl, Head of Business
Development, APAC
Agenda1. What are Advanced Analytics?
2. The Data Challenge
3. The Rise of Logical Data Architectures
4. Tackling the Data Pipeline Problem
5. Customer story
6. Key takeaways
7. Q&A
3
- VentureBeat AI, July 2019
87% of data science projects never
make it into production
4
AI and Machine Learning Needs Data
Predicting high-risk patients
Data includes patient
demographics, family history,
patient vitals, lab test results,
past medication history, visits
to the hospital, and any claims
data
Predicting equipment failure
Data may include
maintenance data logs
maintained by the technicians,
especially for older machines.
For newer machines, data
coming in from the different
sensors of the machine—
including temperature,
running time, power level
durations, and error messages
Predicting default risks
Data includes company or
individual demographics,
products they purchased/
used, past payment history,
customer support logs, and
any recent adverse events.
Preventing fraudulent claims
Data includes the location
where the claim originated,
time of day, claimant history,
claim amount, and even public
data such as the National
Fraud Database.
Predicting customer churn
Data includes customer
demographics, products
purchased, product usage,
customer calls, time since last
contact, past transaction
history, industry, company
size, and revenue.
5
The Scale of the Problem…
6
Confirmation of the Constraints on ML/AI…
Source: Machine learning in UK financialservices, Bank of England
and Financial Conduct Authority, October 2019
Tackling the Data Pipeline Problem
8
Typical data science workflow
Atypical workflow for adata scientistis:
1. Gather the requirements for thebusiness problem
2. Identify usefuldata
▪ Ingest data
3. Cleansedata into ausefulformat
4. Analyze data
5. Prepare input for your algorithms
6. Executedata science algorithms (ML, AI, etc.)
▪ Iterate steps 2 to 6 untilvaluable insights are
produced
7. Visualize and share
Source:
http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
9
Where does your time go?
80% of time – Finding and preparing the data
10% of time – Analysis
10% of time – Visualizing data
10
Where does your time go?
A large amount of time and effort goes into tasks not intrinsically related to data
science:
• Finding where the right data may be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data
points
11
Logical Data Integration: the Path to the Future
Adopt the Logical Data Warehouse Architecture to Meet Your
Modern Analytical Needs”. Henry Cook, Gartner April 2018
12
Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, May 2018
“When designed properly, Data Virtualization can speed data integration, lower data
latency, offer flexibility and reuse, and reduce data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step for organizations
evolving a traditional, repository-style data warehouse into a Logical Architecture”
13
Data scientist workflow
Identify useful
data
Modify datainto
auseful format
Analyze data Executedata
science algorithms
(ML,AI, etc.)
Preparefor
MLalgorithm
14
Identify useful data
If the companyhasavirtual layer withagoodcoverage
of datasources,this taskisgreatlysimplified
• Adata virtualization tool like Denodocanofferunified
accessto all data available in thecompany
• It abstracts the technologiesunderneath,offering a
standardSQLinterface to query andmanipulate
Tofurther simplify the challenge, Denodooffers aData
Catalogto search,find andexplore yourdataassets
15
Data scientist workflow
Identify useful
data
Modify datainto
auseful format
Analyze data Execute data
science algorithms
(ML,AI, etc.)
Preparefor
MLalgorithm
16
Ingestion and data manipulation tasks
• Data virtualization offers the unique
opportunity of using an easy to use
graphical UI and standard SQL (joins,
aggregations, transformations, etc.) to
access, manipulate and analyze any data
• Cleansing and transformation steps can be
easily accomplished in SQL
• Its modeling capabilities enable the
definition of views that embed this logic to
foster reusability
17
Prologis launches data analytics program for cost optimization
Background
• Create a single governed data access layer to create reusable and
consistent analytical assets that could be used by the rest of the
business teams to run their own analytics.
• Save time for data scientists in finding , transforming and analysing
data sets without having to learn new skills and create data models
that could be refreshed on demand.
• Efficiently maintain its new data architecture with minimum
downtime and configuration management.
Prologis is the largest industrial real estate company
in the world, serving 5000 customers in over 20
countries and USD 87 billion in assets under
management.
18
Prologis – Data Science Workflow
Step 1: Expose Data to Data Scientists
19
Prologis – Data Science Workflow
Step 2: Operationalization of Model Scoring
Web Service
(Python Model Scoring)
AWS Lambda
20
Data virtualization benefits experienced by Prologis
• The analytics team was able to create business focussed subject areas with consistent data
sets that were 30% faster in speed to analytics.
• Denodo made it possible for Prologis to quick start advanced analytics projects.
• Stable and scalable operationalisation of their data science project
• The Denodo platform’s deployment was as easy as a click of a button with centralized
configuration management. This simplified Prologis’s data architecture and also helped bring
down the overall maintenance cost.
21
Data virtualization benefits for AI and machine learning projects
✓ The Denodo Platform makes all kinds of data – from a variety of data
sources – readily available to your data analysts and data scientists
✓ Can leverage Big Data technologies like Spark (as a data source, an
ingestion tool and for external processing) to efficiently work with
large data volumes
✓ Data virtualization shortens the ‘data wrangling’ phases of analytics/ML
projects
✓ Avoids needing to write ‘data prep’ scripts in Python, R, etc.
✓ Provides a modern “SQL-on-Anything” engine
✓ Extends and integrates with the capabilities of notebooks, Python, R, etc. to
improve the toolset of the data scientist
✓ New and expanded tools for data scientists and citizen analysts: “Apache
Zeppelin for Denodo” Notebook
Denodo DV can accelerate Data Science Projects during
Modelling and Operations
22
Data virtualization benefits for AI and machine learning projects
✓ Acceler
Q&A
Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies.All rights reserved
Unless otherwise specified,no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorizationfrom Denodo Technologies.

Advanced Analytics and Machine Learning with Data Virtualization (India)

  • 1.
    Advanced Analytics and MachineLearning with Data Virtualization Alex Hoehl, Head of Business Development, APAC
  • 2.
    Agenda1. What areAdvanced Analytics? 2. The Data Challenge 3. The Rise of Logical Data Architectures 4. Tackling the Data Pipeline Problem 5. Customer story 6. Key takeaways 7. Q&A
  • 3.
    3 - VentureBeat AI,July 2019 87% of data science projects never make it into production
  • 4.
    4 AI and MachineLearning Needs Data Predicting high-risk patients Data includes patient demographics, family history, patient vitals, lab test results, past medication history, visits to the hospital, and any claims data Predicting equipment failure Data may include maintenance data logs maintained by the technicians, especially for older machines. For newer machines, data coming in from the different sensors of the machine— including temperature, running time, power level durations, and error messages Predicting default risks Data includes company or individual demographics, products they purchased/ used, past payment history, customer support logs, and any recent adverse events. Preventing fraudulent claims Data includes the location where the claim originated, time of day, claimant history, claim amount, and even public data such as the National Fraud Database. Predicting customer churn Data includes customer demographics, products purchased, product usage, customer calls, time since last contact, past transaction history, industry, company size, and revenue.
  • 5.
    5 The Scale ofthe Problem…
  • 6.
    6 Confirmation of theConstraints on ML/AI… Source: Machine learning in UK financialservices, Bank of England and Financial Conduct Authority, October 2019
  • 7.
    Tackling the DataPipeline Problem
  • 8.
    8 Typical data scienceworkflow Atypical workflow for adata scientistis: 1. Gather the requirements for thebusiness problem 2. Identify usefuldata ▪ Ingest data 3. Cleansedata into ausefulformat 4. Analyze data 5. Prepare input for your algorithms 6. Executedata science algorithms (ML, AI, etc.) ▪ Iterate steps 2 to 6 untilvaluable insights are produced 7. Visualize and share Source: http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/
  • 9.
    9 Where does yourtime go? 80% of time – Finding and preparing the data 10% of time – Analysis 10% of time – Visualizing data
  • 10.
    10 Where does yourtime go? A large amount of time and effort goes into tasks not intrinsically related to data science: • Finding where the right data may be • Getting access to the data • Bureaucracy • Understand access methods and technology (noSQL, REST APIs, etc.) • Transforming data into a format easy to work with • Combining data originally available in different sources and formats • Profile and cleanse data to eliminate incomplete or inconsistent data points
  • 11.
    11 Logical Data Integration:the Path to the Future Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
  • 12.
    12 Gartner, Adopt theLogical Data Warehouse Architecture to Meet Your Modern Analytical Needs, May 2018 “When designed properly, Data Virtualization can speed data integration, lower data latency, offer flexibility and reuse, and reduce data sprawl across dispersed data sources. Due to its many benefits, Data Virtualization is often the first step for organizations evolving a traditional, repository-style data warehouse into a Logical Architecture”
  • 13.
    13 Data scientist workflow Identifyuseful data Modify datainto auseful format Analyze data Executedata science algorithms (ML,AI, etc.) Preparefor MLalgorithm
  • 14.
    14 Identify useful data Ifthe companyhasavirtual layer withagoodcoverage of datasources,this taskisgreatlysimplified • Adata virtualization tool like Denodocanofferunified accessto all data available in thecompany • It abstracts the technologiesunderneath,offering a standardSQLinterface to query andmanipulate Tofurther simplify the challenge, Denodooffers aData Catalogto search,find andexplore yourdataassets
  • 15.
    15 Data scientist workflow Identifyuseful data Modify datainto auseful format Analyze data Execute data science algorithms (ML,AI, etc.) Preparefor MLalgorithm
  • 16.
    16 Ingestion and datamanipulation tasks • Data virtualization offers the unique opportunity of using an easy to use graphical UI and standard SQL (joins, aggregations, transformations, etc.) to access, manipulate and analyze any data • Cleansing and transformation steps can be easily accomplished in SQL • Its modeling capabilities enable the definition of views that embed this logic to foster reusability
  • 17.
    17 Prologis launches dataanalytics program for cost optimization Background • Create a single governed data access layer to create reusable and consistent analytical assets that could be used by the rest of the business teams to run their own analytics. • Save time for data scientists in finding , transforming and analysing data sets without having to learn new skills and create data models that could be refreshed on demand. • Efficiently maintain its new data architecture with minimum downtime and configuration management. Prologis is the largest industrial real estate company in the world, serving 5000 customers in over 20 countries and USD 87 billion in assets under management.
  • 18.
    18 Prologis – DataScience Workflow Step 1: Expose Data to Data Scientists
  • 19.
    19 Prologis – DataScience Workflow Step 2: Operationalization of Model Scoring Web Service (Python Model Scoring) AWS Lambda
  • 20.
    20 Data virtualization benefitsexperienced by Prologis • The analytics team was able to create business focussed subject areas with consistent data sets that were 30% faster in speed to analytics. • Denodo made it possible for Prologis to quick start advanced analytics projects. • Stable and scalable operationalisation of their data science project • The Denodo platform’s deployment was as easy as a click of a button with centralized configuration management. This simplified Prologis’s data architecture and also helped bring down the overall maintenance cost.
  • 21.
    21 Data virtualization benefitsfor AI and machine learning projects ✓ The Denodo Platform makes all kinds of data – from a variety of data sources – readily available to your data analysts and data scientists ✓ Can leverage Big Data technologies like Spark (as a data source, an ingestion tool and for external processing) to efficiently work with large data volumes ✓ Data virtualization shortens the ‘data wrangling’ phases of analytics/ML projects ✓ Avoids needing to write ‘data prep’ scripts in Python, R, etc. ✓ Provides a modern “SQL-on-Anything” engine ✓ Extends and integrates with the capabilities of notebooks, Python, R, etc. to improve the toolset of the data scientist ✓ New and expanded tools for data scientists and citizen analysts: “Apache Zeppelin for Denodo” Notebook Denodo DV can accelerate Data Science Projects during Modelling and Operations
  • 22.
    22 Data virtualization benefitsfor AI and machine learning projects ✓ Acceler
  • 23.
  • 24.
    Thanks! www.denodo.com info@denodo.com © CopyrightDenodo Technologies.All rights reserved Unless otherwise specified,no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorizationfrom Denodo Technologies.