Watch the full webinar on demand here: https://goo.gl/5VyGns
Denodo Platform offers one of the most sought after data fabric capabilities through data discovery, preparation, curation and integration across the broadest range of data sources. As data volume and variety grows exponentially, Denodo Platform 7.0 will offer in-memory massive parallel processing (MPP) capability for the most advanced query optimization in the market.
Attend this session to learn:
• How Denodo Platform 7.0’s native built-in integration with MPP systems will provide query acceleration and MPP caching
• How to successfully approach highly complex big data scenarios, leveraging inexpensive MPP solutions
• With the MPP capability in place, how data driven insights can be generated in real-time with Denodo Platform
Agenda:
• Challenges with traditional architectures
• Denodo Platform MPP capabilities and applications
• Product demonstration
• Q&A
Customer Service Analytics - Make Sense of All Your Data.pptx
In Memory Parallel Processing for Big Data Scenarios
1. DATA VIRTUALIZATION PACKED LUNCH
WEBINAR SERIES
Sessions Covering Key Data Integration Challenges
Solved with Data Virtualization
2. In Memory Parallel Processing for Big
Data Scenarios
Pablo Alvarez-Yanez
Principal Solution Architect, Denodo
Paul Moxon
VP Data Architectures & Chief Evangelist, Denodo
3. Agenda
1. Modern Data Architecture
2. Denodo Platform – Big Data Integrations
3. Big Data Performance
4. Putting This All Together
5. Q&A
6. Next Steps
10. Hadoop as a Data Source
10
Denodo offers native connectors for all the
major SQL-on-Hadoop engines:
Hive
Impala
SparkSQL
Presto
In addition, Denodo also offers connectivity for
HBase and direct HDFS access to different file
formats
11. Hadoop as a Cache
11
Denodo uses an external RDBMS of your choice to
persist copies of the result sets to improve
execution times
• Since data is persisted in an RDBMS, Denodo can
push down relational operations, like JOINS with
other tables, to the database used for cache
SQL-on-Hadoop systems can also be used as
Denodo’s cache
Cache load process based on direct load to HDFS:
1. Creation of the target table in Cache system
2. Generation of Parquet files (in chunks) with Snappy
compression in the local machine
3. Upload in parallel of Parquet files to HDFS
12. Hadoop as a Processing Engine
12
Denodo optimizer provides native integration
with MPP systems to provide one extra key
capability: Query Acceleration
Denodo can move, on demand, processing to
the MPP during execution of a query
• Parallel power for calculations in the
virtual layer
• Avoids slow processing in-disk when
processing buffers don’t fit into Denodo’s
memory (swapped data)
14. 14
Combining Denodo’s Optimizer with a Hadoop MPP
Denodo provides the most advanced optimizer in the
market, with techniques focused on data virtualization
scenarios with large data volumes
In addition to traditional Cost Based Optimizations (CBO),
Denodo’s optimizer applies innovative optimization
strategies, designed specifically for virtualized scenarios,
beyond traditional RDBMS optimizatios.
Combined with the tight integration with SQL-on-Hadoop
MPP databases, it creates a very powerful combo
15. 15
Example Scenario
Trend of sales by zip code over the
previous years.
Scenario:
• Current data (last 12 months) in EDW
• Historical data offloaded to Hadoop cluster for
cheaper storage
• Customer master data is used often, so it is
cached in the Hadoop cluster
Very large data volumes:
• Sales tables have hundreds of millions of rows
join
group by ZIP
union
Current Sales
100 million rows
Historical Sales
300 million rows
Customer
2 million rows (cached)
16. 16
Example: What are the options?
1. Option A: Simple Federation in Virtual Layer
• Move hundreds of millions of rows for processing in the virtual layer
2. Option B: Data Shipping
• Move “Current sales” to Hadoop and process content in the cluster
• Moves 100 million rows
3. Option C: Partial Aggregation Pushdown (Denodo 6)
• Modifies the execution tree to split the aggregation in two steps:
1. First by Customer ID for the JOIN (pushed down to source)
2. Second by zip code for the final results (in virtual layer)
• Reduces significantly network traffic but processing of large amount of
data in the virtual layer (aggregation by zip code) becomes the bottleneck
4. Denodo’s MPP Integration (Denodo 7)
Simple Federation
Shipping
join
group by ID
group by ZIP
group by ZIP
join
17. 17
Example: Denodo’s Integration with the Hadoop Ecosystem
2M rows
(sales by customer)
System Execution Time Optimization Techniques
Others ~ 10 min Simple federation
No MPP 43 sec Aggregation push-down
With MPP 11 sec
Aggregation push-down + MPP integration
(Impala 8 nodes)
Current Sales
100 M rows
group by
customer ID1. Partial Aggregation
push down
Maximizes source processing
dramatically Reduces network
traffic
3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
300 M rows
Customer
2 M rows
(cached)
join
group by ZIP
20. Putting all the pieces together
20
These three techniques (Hadoop as a data source, cache and
processing engine) can be combined to successfully approach
complex scenarios with big data volumes in an efficient way:
Surfaces all the company data without the need to replicate
all data to the Hadoop lake
Improves governance and metadata management to avoid
“data swamps”
Allows for on-demand combination of real-time (from the
original sources) with historical data (in the cluster)
Leverages the processing power of the existing cluster
controlled by Denodo’s optimizer
21. Architecture
21
To benefit from this architecture,
Denodo servers should run in edge
nodes of the Hadoop cluster
This will ensure:
Faster uploads to HDFS
Faster data retrieval from the MPP
Better compatibility with the
Hadoop configuration and versions
of the libraries
Denodo Cluster
Multiple nodes behind
a load balancer for HA
Running on Hadoop
Edge nodes
Hadoop Cluster
Processing and Storage nodes
Same subnet as Denodo
cluster
23. Next steps
Download Denodo Express:
www.denodoexpress.com
Access Denodo Platform in the Cloud!
30 day FREE trial available!
Denodo for Azure:
www.denodo.com/TrialAzure/PackedLunch
Denodo for AWS: www.denodo.com/TrialAWS/PackedLunch
24. Next session
Self-Service Information Consumption
Using Data Catalog
Thursday, March 15th, 2018 at 11:00am (PST)
Paul Moxon
VP Data Architectures and Chief Evangelist