DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization
Advanced Analytics and Machine Learning
with Data Virtualization
Senthil Ramesh J V
Sales Engineer, Denodo
Paul Moxon
SVP Data Architectures & Chief Evangelist, Denodo
3
The Economist, May 2017
The world’s most valuable resource
is no longer oil, but data.
4
Data – Like Oil – Is Not Easy To Extract and Use
5
Forbes Insights – AI Adoption Survey
Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers
72%
6
Forbes Insights – AI Adoption Survey
Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers
7
AI and Machine Learning Needs Data
Predicting high-risk patients
Data includes patient
demographics, family history,
patient vitals, lab test results,
past medication history, visits
to the hospital, and any claims
data
Predicting equipment failure
Data may include
maintenance data logs
maintained by the technicians,
especially for older machines.
For newer machines, data
coming in from the different
sensors of the machine—
including temperature,
running time, power level
durations, and error messages
Predicting default risks
Data includes company or
individual demographics,
products they purchased/
used, past payment history,
customer support logs, and
any recent adverse events.
Preventing fraudulent claims
Data includes the location
where the claim originated,
time of day, claimant history,
claim amount, and even public
data such as the National
Fraud Database.
Predicting customer churn
Data includes customer
demographics, products
purchased, product usage,
customer calls, time since last
contact, past transaction
history, industry, company
size, and revenue.
8
But the Data is Somewhere in Here…
9
Confirmation of the Constraints on ML/AI…
Source: Machine learning in UK financial services, Bank of England
and Financial Conduct Authority, October 2019
10
The Scale of the Problem…
11
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify data useful for the case
• Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
• Iterate 2-6 until valuable insights are
produced
7. Visualize and share
12
Typical Data Science Workflow
80% of time – Finding and preparing the data
10% of time – Analysis
10% of time – Visualizing data
13
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically
related to data science:
• Finding where the right data might be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data
points
14
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry
Cook, Gartner April 2018
DATA VIRTUALIZATION
Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical
Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and reduce
data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step for
organizations evolving a traditional, repository-style data warehouse
into a Logical Architecture”
16
Benefits of a Virtual Data Layer
 A Virtual Layer improves decision making and shortens development cycles
• Surfaces all company data from multiple repositories without the need to replicate all data
into a lake
• Eliminates data silos: allows for on-demand combination of data from multiple sources
 A Virtual Layer broadens usage of data
• Improves governance and metadata management to avoid “data swamps”
• Decouples data source technology. Access normalized via SQL or web services
• Allows controlled access to the data with low grain security controls
 A Virtual Layer offers performant access
• Leverages the processing power of the existing sources controlled by Denodo’s optimizer
• Processing of data for sources with no processing capabilities (e.g. files)
• Caching and ingestion engine to persist data when needed
17
Data Scientist Workflow Steps
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Share with
business users
Prepare for
ML algorithm
Product Demonstration:
Advanced Analytics and Machine Learning
with Data Virtualization
18
Senthil Ramesh J V
Sales Engineer, Denodo
19
https://flic.kr/p/x8HgrF
Can we predict the usage of the NYC
bike system based on data from
previous years?
20
Data Sources – Citibike
21
There are external factors to
consider.
Which ones?
https://flic.kr/p/CYT7SS
22
Data Sources – NWS Weather Data
23
What We’re Going To Do…
1. Connect to data and have a look
2. Format the data (prep it) so that we can look for significant factors
• e.g. bike trips on different days of week, different months of year, etc.
3. Once we’ve decided on the significant attributes, prepare that data
for the ML algorithm
4. Using Python, read the 2017 data and run it through our ML
algorithm for training
5. Read the 2018 data, test the algorithm
6. Save the results and load them into the Denodo Platform
(Find Data)
(Explore Data)
(Prepare the Data)
(Train the Model)
(Test the Model)
(Save the results)
Demo
24
25
Prologis – Operationalizing AI/ML
$1.5TRILLION
is the economic value of goods flowing through
our distribution centers each year, representing:
2.8%
of GDP for the 19 countries where
we do business
%2.0
of the World’s GDP
1983 100 GLOBAL 768 MSF
Founded Most sustainable corporations
$87B
Assets under management on four continents
MILLION
employees under Prologis’ roofs
1.0
26
Prologis – Data Science Workflow
Step 1: Expose Data to Data Scientists
27
Prologis – Data Science Workflow
Step 2: Operationalization of Model Scoring
Web Service
(Python Model Scoring)
AWS Lambda
Key Takeaways
28
29
Key Takeaways
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others
 Finally…People don’t like to ride their bikes in the snow
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others
31
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY
Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

Advanced Analytics and Machine Learning with Data Virtualization

  • 1.
    DATA VIRTUALIZATION PACKEDLUNCH WEBINAR SERIES Sessions Covering Key Data Integration Challenges Solved with Data Virtualization
  • 2.
    Advanced Analytics andMachine Learning with Data Virtualization Senthil Ramesh J V Sales Engineer, Denodo Paul Moxon SVP Data Architectures & Chief Evangelist, Denodo
  • 3.
    3 The Economist, May2017 The world’s most valuable resource is no longer oil, but data.
  • 4.
    4 Data – LikeOil – Is Not Easy To Extract and Use
  • 5.
    5 Forbes Insights –AI Adoption Survey Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers 72%
  • 6.
    6 Forbes Insights –AI Adoption Survey Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers
  • 7.
    7 AI and MachineLearning Needs Data Predicting high-risk patients Data includes patient demographics, family history, patient vitals, lab test results, past medication history, visits to the hospital, and any claims data Predicting equipment failure Data may include maintenance data logs maintained by the technicians, especially for older machines. For newer machines, data coming in from the different sensors of the machine— including temperature, running time, power level durations, and error messages Predicting default risks Data includes company or individual demographics, products they purchased/ used, past payment history, customer support logs, and any recent adverse events. Preventing fraudulent claims Data includes the location where the claim originated, time of day, claimant history, claim amount, and even public data such as the National Fraud Database. Predicting customer churn Data includes customer demographics, products purchased, product usage, customer calls, time since last contact, past transaction history, industry, company size, and revenue.
  • 8.
    8 But the Datais Somewhere in Here…
  • 9.
    9 Confirmation of theConstraints on ML/AI… Source: Machine learning in UK financial services, Bank of England and Financial Conduct Authority, October 2019
  • 10.
    10 The Scale ofthe Problem…
  • 11.
    11 Typical Data ScienceWorkflow A typical workflow for a data scientist is: 1. Gather the requirements for the business problem 2. Identify data useful for the case • Ingest data 3. Cleanse data into a useful format 4. Analyze data 5. Prepare input for your algorithms 6. Execute data science algorithms (ML, AI, etc.) • Iterate 2-6 until valuable insights are produced 7. Visualize and share
  • 12.
    12 Typical Data ScienceWorkflow 80% of time – Finding and preparing the data 10% of time – Analysis 10% of time – Visualizing data
  • 13.
    13 Where Does YourTime Go? A large amount of time and effort goes into tasks not intrinsically related to data science: • Finding where the right data might be • Getting access to the data • Bureaucracy • Understand access methods and technology (noSQL, REST APIs, etc.) • Transforming data into a format easy to work with • Combining data originally available in different sources and formats • Profile and cleanse data to eliminate incomplete or inconsistent data points
  • 14.
    14 Gartner – LogicalData Warehouse “Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018 DATA VIRTUALIZATION
  • 15.
    Gartner, Adopt theLogical Data Warehouse Architecture to Meet Your Modern Analytical Needs, May 2018 “When designed properly, Data Virtualization can speed data integration, lower data latency, offer flexibility and reuse, and reduce data sprawl across dispersed data sources. Due to its many benefits, Data Virtualization is often the first step for organizations evolving a traditional, repository-style data warehouse into a Logical Architecture”
  • 16.
    16 Benefits of aVirtual Data Layer  A Virtual Layer improves decision making and shortens development cycles • Surfaces all company data from multiple repositories without the need to replicate all data into a lake • Eliminates data silos: allows for on-demand combination of data from multiple sources  A Virtual Layer broadens usage of data • Improves governance and metadata management to avoid “data swamps” • Decouples data source technology. Access normalized via SQL or web services • Allows controlled access to the data with low grain security controls  A Virtual Layer offers performant access • Leverages the processing power of the existing sources controlled by Denodo’s optimizer • Processing of data for sources with no processing capabilities (e.g. files) • Caching and ingestion engine to persist data when needed
  • 17.
    17 Data Scientist WorkflowSteps Identify useful data Modify data into a useful format Analyze data Execute data science algorithms (ML, AI, etc.) Share with business users Prepare for ML algorithm
  • 18.
    Product Demonstration: Advanced Analyticsand Machine Learning with Data Virtualization 18 Senthil Ramesh J V Sales Engineer, Denodo
  • 19.
    19 https://flic.kr/p/x8HgrF Can we predictthe usage of the NYC bike system based on data from previous years?
  • 20.
  • 21.
    21 There are externalfactors to consider. Which ones? https://flic.kr/p/CYT7SS
  • 22.
    22 Data Sources –NWS Weather Data
  • 23.
    23 What We’re GoingTo Do… 1. Connect to data and have a look 2. Format the data (prep it) so that we can look for significant factors • e.g. bike trips on different days of week, different months of year, etc. 3. Once we’ve decided on the significant attributes, prepare that data for the ML algorithm 4. Using Python, read the 2017 data and run it through our ML algorithm for training 5. Read the 2018 data, test the algorithm 6. Save the results and load them into the Denodo Platform (Find Data) (Explore Data) (Prepare the Data) (Train the Model) (Test the Model) (Save the results)
  • 24.
  • 25.
    25 Prologis – OperationalizingAI/ML $1.5TRILLION is the economic value of goods flowing through our distribution centers each year, representing: 2.8% of GDP for the 19 countries where we do business %2.0 of the World’s GDP 1983 100 GLOBAL 768 MSF Founded Most sustainable corporations $87B Assets under management on four continents MILLION employees under Prologis’ roofs 1.0
  • 26.
    26 Prologis – DataScience Workflow Step 1: Expose Data to Data Scientists
  • 27.
    27 Prologis – DataScience Workflow Step 2: Operationalization of Model Scoring Web Service (Python Model Scoring) AWS Lambda
  • 28.
  • 29.
    29 Key Takeaways  TheDenodo Platform makes all kinds of data – from a variety of data sources – readily available to your data analysts and data scientists  Data virtualization shortens the ‘data wrangling’ phases of analytics/ML projects  Avoids needing to write ‘data prep’ scripts in Python, R, etc.  It’s easy to access and analyze the data from analytics tools such as Zeppelin or Jupyter  You can use the Denodo Platform to share the results of your analytics with others  Finally…People don’t like to ride their bikes in the snow  The Denodo Platform makes all kinds of data – from a variety of data sources – readily available to your data analysts and data scientists  Data virtualization shortens the ‘data wrangling’ phases of analytics/ML projects  Avoids needing to write ‘data prep’ scripts in Python, R, etc.  It’s easy to access and analyze the data from analytics tools such as Zeppelin or Jupyter  You can use the Denodo Platform to share the results of your analytics with others
  • 31.
    31 Next Steps Access DenodoPlatform in the Cloud! Take a Test Drive today! www.denodo.com/TestDrive G E T S TA R T E D TO DAY
  • 32.
    Thank you! © CopyrightDenodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.