Advanced Analytics and Machine Learning with Data Virtualization

DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization

Advanced Analytics and Machine Learning
with Data Virtualization
Senthil Ramesh J V
Sales Engineer, Denodo
Paul Moxon
SVP Data Architectures & Chief Evangelist, Denodo

3
The Economist, May 2017
The world’s most valuable resource
is no longer oil, but data.

4
Data – Like Oil – Is Not Easy To Extract and Use

5
Forbes Insights – AI Adoption Survey
Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers
72%

6
Forbes Insights – AI Adoption Survey
Source: Forbes Insights Survey – The AI Learning Curve, By The Numbers

7
AI and Machine Learning Needs Data
Predicting high-risk patients
Data includes patient
demographics, family history,
patient vitals, lab test results,
past medication history, visits
to the hospital, and any claims
data
Predicting equipment failure
Data may include
maintenance data logs
maintained by the technicians,
especially for older machines.
For newer machines, data
coming in from the different
sensors of the machine—
including temperature,
running time, power level
durations, and error messages
Predicting default risks
Data includes company or
individual demographics,
products they purchased/
used, past payment history,
customer support logs, and
any recent adverse events.
Preventing fraudulent claims
Data includes the location
where the claim originated,
time of day, claimant history,
claim amount, and even public
data such as the National
Fraud Database.
Predicting customer churn
Data includes customer
demographics, products
purchased, product usage,
customer calls, time since last
contact, past transaction
history, industry, company
size, and revenue.

8
But the Data is Somewhere in Here…

9
Confirmation of the Constraints on ML/AI…
Source: Machine learning in UK financial services, Bank of England
and Financial Conduct Authority, October 2019

10
The Scale of the Problem…

11
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify data useful for the case
• Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
• Iterate 2-6 until valuable insights are
produced
7. Visualize and share

12
Typical Data Science Workflow
80% of time – Finding and preparing the data
10% of time – Analysis
10% of time – Visualizing data

13
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically
related to data science:
• Finding where the right data might be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data
points

14
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry
Cook, Gartner April 2018
DATA VIRTUALIZATION

Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical
Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and reduce
data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step for
organizations evolving a traditional, repository-style data warehouse
into a Logical Architecture”

16
Benefits of a Virtual Data Layer
 A Virtual Layer improves decision making and shortens development cycles
• Surfaces all company data from multiple repositories without the need to replicate all data
into a lake
• Eliminates data silos: allows for on-demand combination of data from multiple sources
 A Virtual Layer broadens usage of data
• Improves governance and metadata management to avoid “data swamps”
• Decouples data source technology. Access normalized via SQL or web services
• Allows controlled access to the data with low grain security controls
 A Virtual Layer offers performant access
• Leverages the processing power of the existing sources controlled by Denodo’s optimizer
• Processing of data for sources with no processing capabilities (e.g. files)
• Caching and ingestion engine to persist data when needed

17
Data Scientist Workflow Steps
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Share with
business users
Prepare for
ML algorithm

Product Demonstration:
Advanced Analytics and Machine Learning
with Data Virtualization
18
Senthil Ramesh J V
Sales Engineer, Denodo

19
https://flic.kr/p/x8HgrF
Can we predict the usage of the NYC
bike system based on data from
previous years?

21
There are external factors to
consider.
Which ones?
https://flic.kr/p/CYT7SS

22
Data Sources – NWS Weather Data

23
What We’re Going To Do…
1. Connect to data and have a look
2. Format the data (prep it) so that we can look for significant factors
• e.g. bike trips on different days of week, different months of year, etc.
3. Once we’ve decided on the significant attributes, prepare that data
for the ML algorithm
4. Using Python, read the 2017 data and run it through our ML
algorithm for training
5. Read the 2018 data, test the algorithm
6. Save the results and load them into the Denodo Platform
(Find Data)
(Explore Data)
(Prepare the Data)
(Train the Model)
(Test the Model)
(Save the results)

25
Prologis – Operationalizing AI/ML
$1.5TRILLION
is the economic value of goods flowing through
our distribution centers each year, representing:
2.8%
of GDP for the 19 countries where
we do business
%2.0
of the World’s GDP
1983 100 GLOBAL 768 MSF
Founded Most sustainable corporations
$87B
Assets under management on four continents
MILLION
employees under Prologis’ roofs
1.0

26
Prologis – Data Science Workflow
Step 1: Expose Data to Data Scientists

27
Prologis – Data Science Workflow
Step 2: Operationalization of Model Scoring
Web Service
(Python Model Scoring)
AWS Lambda

29
Key Takeaways
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others
 Finally…People don’t like to ride their bikes in the snow
 The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
 Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects
 Avoids needing to write ‘data prep’ scripts in Python, R, etc.
 It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
 You can use the Denodo Platform to share the results of your
analytics with others

31
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY

Thank you!
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and
microfilm, without prior the written authorization from Denodo Technologies.

Advanced Analytics and Machine Learning with Data Virtualization

More Related Content

What's hot

Similar to Advanced Analytics and Machine Learning with Data Virtualization

More from Denodo

Recently uploaded

Advanced Analytics and Machine Learning with Data Virtualization