Syncsort et le retour d'expérience ComScore

High Performance ETL in a #BigData #Hadoop context
Steven Haddad – Senior Software Architect
Stéphane Heckel – Partner Manager
Hadoop User Group - September 12th 2012

Syncsort – Solving Big Data Breakpoints for 40 years
Company Track Record
• Global Software Company
• 40+ Years of Performance Innovation
• 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
• 16,000+ deployments
• 68 Countries
• Across all verticals
2
Expertise & Specialism
• Leading provider of high-performance
data integration solutions
• Data Integration Acceleration and Cost
Optimization
• Delivering Cost Reduction Initiatives
whilst delivering superior performance
• Typical TCO reduction of 50% - 75%
• Customer ROI within 12 months
•
DATA SERVICES
•
FINANCE
•
INSURANCE & HEALTHCARE
TRAVEL & TRANSPORT
•
RETAIL
•
TELECOMMUNICATIONS

A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor │ Job Editor SDK
Shared File-based
Metadata Repository
Data
Lineage
Metadata Interchange
Global
Search
Impact
Analysis
Small Footprint
ETL Engine
Self-tuning
Optimizer
Native, Direct I/O
Access
Install in Minutes. Deploy in Weeks. Never Tune Again.
High Performance Connectivity
Mainframe
Files / XML
Appliances Hadoop
Cloud
Real Time
Template-
driven Design
DMExpress Server Engine
High
Performance
Transformation
s
High
Performance
Functions
Automatic
Continuous Optimization

4
Syncsort’s Hadoop value proposition
Syncsort Value proposition on Hadoop

Syncsort Goes Beyond Basic Connectivity to Enhance
Hadoop and Facilitate Wider Adoption
 HDFS connectivity: Ability to move data in & out of
Hadoop file system
 Enhanced usability: Ability to create jobs using DMExpress
graphical user interface and run it in the Hadoop MapReduce
framework
 Contribute to the Open Source Community: Enhance
Hadoop sort framework for everyone. Make it more
modular, flexible, extensible
 Accelerate Hadoop: Address existing drawbacks in Hadoop
native sort by providing a simple, self-tuning alternative to
increase overall MapReduce performance and facilitate
ongoing development and maintenance
5
Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and
processing capabilities to optimize Hadoop environments
Extract Preprocess & Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HDFS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Time
(min)
Elapsed Processing Time
HDFS
Put DMExpress
Connect to virtually any
source
Pre-process data to cleanse,
validate, & partition for better
and faster Hadoop processing
and significant storage savings
Load data up to 6x faster!
6

DMExpress – HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
– Partition the output for parallel loading
– Makes full use of network bandwidth with
reduced elapsed time
– Hadoop/DMExpress can process wildcard
input files from HDFS
Extract HDFS
– DMExpress can read wildcard inputs in
parallel
7
Distributions supported
– Cloudera CDH3u3
– Hortonworks Data Platform 1.0.7
– Greenplum HD 1.1

DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
– Uncompressed input file size
from 10GB to 100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
8
3x-6x
Faster!

DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
from 100GB to 2100GB
– Red Hat EL 5.8
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
9
6x Faster!

Enabling Storage Savings and
Accelerating Performance with DMExpress
• Load data faster into HDFS
• Store twice as much data on the cluster
• Improve overall performance by pre-sorting, cleansing and
partitioning
• Achieve higher rate of parallelism
• Realize up to 75TB of data storage savings a month
DMExpress is enabling
comScore to
32B
records
/
day
Load files Cleanse,sort,
compress,
partition.
Load to HDFS
Post-processing &
analysis
DMExpress
Node
Node
Node
Node
HDFS
Hadoop
10

11
Michael Brown, Chief
Scientist, comScore

DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
– Allow external sort to be plugged in
– Improve developer productivity
• Develop MapReduce jobs via DMExpress GUI
– Aggregations, cleansing/filtering, reformatting,
etc.
– Seamlessly accelerate MapReduce performance
• Replace Map output sorter
• Replace Reduce input sorter
https://issues.apache.org/jira/browse/MAPREDUCE-2454
12

DMExpress Accelerates HDFS Loading
HDFS Load
– 20 partitions
from 100GB to 2100GB
– Red Hat EL 5.8
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
13
6x Faster!

Accelerate Development & Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development:
Χ Lots of manual coding:
Χ MapReduce, Pig, Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition:
 No coding required
 Leverages the same skills most IT
organizations already have
 New resources can be trained in just 3 days
14

Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (i.e., Java, Pig,
Python)
DMExpress Hadoop runs native
on each data node on the cluster
– DMExpress is installed on each
data node
– Same benefits as High-performance
ETL
Issues with code generation
– Requires re-compilation with every
change
– May still require MR skills
– Ongoing issues with efficiency of
generated code
15 Sy
nc
DMX DMX DMX DMX
Hadoop Cluster
DMX

0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elapsed
Time
(sec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant
Performance Improvements
TPC-H Benchmark
– Filter & Aggregation
– GZIP compression
from 100GB to 2.4TB
– Hadoop distribution: CDH3U2
– Red Hat EL 5.8
– 6 disks/node
– Read : 870MBs, Write: 660MBs
– Memory: 94 GB
TPC-H Benchmark
16
Almost 2x
Faster than
Java; Over 2x
Faster Pig

17
Conclusion
Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
– DMExpress partitioning allows taking advantage of
full network bandwidth
– High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
– Files, DBMS, mainframe
Ease of development (GUI vs. Java/Pig)
High performance ETL operations (MapReduce)
– Aggregation, sort, filter, copy, reformatting, join,
merge
Seamless high performance sort
18

Syncsort et le retour d'expérience ComScore

More Related Content

What's hot

Viewers also liked

Similar to Syncsort et le retour d'expérience ComScore

More from Modern Data Stack France

Syncsort et le retour d'expérience ComScore