How Data Virtualization Adds Value to Your Data Science Stack

DATA VIRTUALIZATION
APAC WEBINAR SERIES
Sessions Covering Key Data
Integration Challenges Solved
with Data Virtualization

How Data Virtualization adds value to your data
science stack
Chris Day
Director, APAC Sales Engineering, Denodo
Sushant Kumar
Product Marketing Manager, Denodo

Agenda
1. The data science stack
2. The data science workflow
3. Logical data lake architecture
4. Data virtualization features for data scientists
5. Demo
6. Q&A
7. Next Steps

How Data Virtualization adds value
to your data science stack
4
Product Marketing Manager, Denodo
Sushant Kumar

53
The Tools of Data Science
When thinking about data science, most
minds immediately go to languages of
Python and R, or tools like Spark and
TensorFlow
There is a myriad projects that currently
serve the needs of the data scientists

6
The Data Scientist Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify useful data
▪ Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
▪ Iterate steps 2 to 6 until valuable insights are produced
7. Visualize and share
Source:
http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/

7
Where does your time go?
A large amount of time and effort goes into tasks not intrinsically related to data science:
• Finding where the right data may be
• Getting access to the data
• Bureaucracy
• Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data points

8
Reference Architecture
ETL
Data Warehouse
Kafka
SparkML
Logical Data Lake
Spark
Streaming
SQL
interface
Distributed Storage (HDFS,
S3)
Physical Data Lake
Files

9
Data Scientist Flow
Identify useful
data
Modify data into
a useful format
Analyze
data
Execute data
science algorithms
(ML, AI, etc.)
Prepare for
ML algorithm

10
Identify useful data
If the company has a virtual layer with a good coverage of data
sources, this task is greatly simplified
• A data virtualization tool like Denodo can offer unified access to
all data available in the company
• It abstracts the technologies underneath, offering a standard
SQL interface to query and manipulate
To further simplify the challenge, Denodo offers a Data
Catalog to search, find and explore your data assets

11
Search & Explore: Metadata
Search the catalog and refine your results using descriptions, tags and business
categories

12
Search& Explore:Content
Integration with Lucene and ElasticSearch for indexing and performing keyword-
base searches on the content

13
Document your models
Rich HTML descriptions, editable directly from the catalog
Extended metadata support to enrich the catalog with custom fields and details

14
Data Scientist Flow
Identify useful
data
Modify data into
a useful format
Analyze
data
Execute data
science algorithms
(ML, AI, etc.)
Prepare for
ML algorithm

15
Ingestion and Data Manipulation tasks
• Typically, scientists get data from a variety of places through
various formats and protocols. From relational databases, to
REST web services or noSQL engines.
• Data is often exported into CSV files or loaded into Spark
• Later, that data is manipulated in scripts (e.g. Pandas and
Python)
• However, data virtualization offers the unique opportunity of
using standard SQL (joins, aggregations, transformations, etc.)
to access, manipulate and analyze any data
• Cleansing and transformation steps can be easily accomplished in
SQL
• Its modeling capabilities enable the definition of views that embed
this logic to foster reusability

18
Denodo and Spark: data science with large volumes
Spark as a source: Spark, as well as many other Hadoop systems (Hive, Presto, Impala,
HBase, etc.), can be use by Denodo as a data source to read data
• Denodo will push down the execution to those systems, translating SQL into their
corresponding dialects
Spark as the processing engine: In cases where Denodo needs to post-process data,
for example in multi-source queries, Denodo is able to lift and shift to automatically
use Spark’s engine for execution
Spark as the data target: Denodo can automatically save the data from any execution
in a target Spark cluster when your processing needs (e.g. SparkML) require local data

Product Demonstration
Director, APAC Sales Engineering, Denodo
Chris Day

20
Key Takeaways
✓ Denodo can play key role in the data science ecosystem
to reduce data exploration and analysis timeframes
✓ Extends and integrates with the capabilities of notebooks,
Python, R, etc. to improve the toolset of the data scientist
✓ Provides a modern “SQL-on-Anything” engine
✓ Can leverage Big Data technologies like Spark (as a data
source, an ingestion tool and for external processing)
to efficiently work with large data volumes
✓ Helps productionalize data science

22
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
https://bit.ly/2AouQLQ
GET STARTED TODAY

Q&A
Next Session: Jun 25
What is the future of data strategy?

Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical,
including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

How Data Virtualization Adds Value to Your Data Science Stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Data Virtualization Adds Value to Your Data Science Stack

Similar to How Data Virtualization Adds Value to Your Data Science Stack (20)

More from Denodo

More from Denodo (20)

Recently uploaded

Recently uploaded (20)

How Data Virtualization Adds Value to Your Data Science Stack