High Performance ETL in a #BigData #Hadoop context
Steven Haddad – Senior Software Architect
Stéphane Heckel – Partner Manager
Hadoop User Group - September 12th 2012
Syncsort – Solving Big Data Breakpoints for 40 years
Company Track Record
• Global Software Company
• 40+ Years of Performance Innovation
• 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
• 16,000+ deployments
• 68 Countries
• Across all verticals
2
Expertise & Specialism
• Leading provider of high-performance
data integration solutions
• Data Integration Acceleration and Cost
Optimization
• Delivering Cost Reduction Initiatives
whilst delivering superior performance
• Typical TCO reduction of 50% - 75%
• Customer ROI within 12 months
•
DATA SERVICES
•
FINANCE
•
INSURANCE & HEALTHCARE
TRAVEL & TRANSPORT
•
RETAIL
•
TELECOMMUNICATIONS
A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor │ Job Editor SDK
Shared File-based
Metadata Repository
Data
Lineage
Metadata Interchange
Global
Search
Impact
Analysis
Small Footprint
ETL Engine
Self-tuning
Optimizer
Native, Direct I/O
Access
Install in Minutes. Deploy in Weeks. Never Tune Again.
High Performance Connectivity
Mainframe
Files / XML
Appliances Hadoop
Cloud
Real Time
Template-
driven Design
DMExpress Server Engine
High
Performance
Transformation
s
High
Performance
Functions
Automatic
Continuous Optimization
4
Syncsort’s Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance
Hadoop and Facilitate Wider Adoption
 HDFS connectivity: Ability to move data in & out of
Hadoop file system
 Enhanced usability: Ability to create jobs using DMExpress
graphical user interface and run it in the Hadoop MapReduce
framework
 Contribute to the Open Source Community: Enhance
Hadoop sort framework for everyone. Make it more
modular, flexible, extensible
 Accelerate Hadoop: Address existing drawbacks in Hadoop
native sort by providing a simple, self-tuning alternative to
increase overall MapReduce performance and facilitate
ongoing development and maintenance
5
Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and
processing capabilities to optimize Hadoop environments
Extract Preprocess & Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HDFS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Time
(min)
Elapsed Processing Time
HDFS
Put DMExpress
Connect to virtually any
source
Pre-process data to cleanse,
validate, & partition for better
and faster Hadoop processing
and significant storage savings
Load data up to 6x faster!
6
DMExpress – HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
– Partition the output for parallel loading
– Makes full use of network bandwidth with
reduced elapsed time
– Hadoop/DMExpress can process wildcard
input files from HDFS
Extract HDFS
– DMExpress can read wildcard inputs in
parallel
7
Distributions supported
– Cloudera CDH3u3
– Hortonworks Data Platform 1.0.7
– Greenplum HD 1.1
DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
– Uncompressed input file size
from 10GB to 100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
8
3x-6x
Faster!
DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
– Uncompressed input file size
from 100GB to 2100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
9
6x Faster!
Enabling Storage Savings and
Accelerating Performance with DMExpress
• Load data faster into HDFS
• Store twice as much data on the cluster
• Improve overall performance by pre-sorting, cleansing and
partitioning
• Achieve higher rate of parallelism
• Realize up to 75TB of data storage savings a month
DMExpress is enabling
comScore to
32B
records
/
day
Load files Cleanse,sort,
compress,
partition.
Load to HDFS
Post-processing &
analysis
DMExpress
Node
Node
Node
Node
HDFS
Hadoop
10
11
Michael Brown, Chief
Scientist, comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
– Allow external sort to be plugged in
– Improve developer productivity
• Develop MapReduce jobs via DMExpress GUI
– Aggregations, cleansing/filtering, reformatting,
etc.
– Seamlessly accelerate MapReduce performance
• Replace Map output sorter
• Replace Reduce input sorter
https://issues.apache.org/jira/browse/MAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute
12
DMExpress Accelerates HDFS Loading
HDFS Load
– 20 partitions
– Uncompressed input file size
from 100GB to 2100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
13
Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster!
Accelerate Development & Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development:
Χ Lots of manual coding:
Χ MapReduce, Pig, Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition:
 No coding required
 Leverages the same skills most IT
organizations already have
 New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute
14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (i.e., Java, Pig,
Python)
DMExpress Hadoop runs native
on each data node on the cluster
– DMExpress is installed on each
data node
– Same benefits as High-performance
ETL
Issues with code generation
– Requires re-compilation with every
change
– May still require MR skills
– Ongoing issues with efficiency of
generated code
15 Sy
nc
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elapsed
Time
(sec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant
Performance Improvements
TPC-H Benchmark
– Filter & Aggregation
– GZIP compression
– Uncompressed input file size
from 100GB to 2.4TB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH3U2
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Read : 870MBs, Write: 660MBs
– Memory: 94 GB
TPC-H Benchmark
16
Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x
Faster than
Java; Over 2x
Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
– DMExpress partitioning allows taking advantage of
full network bandwidth
– High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
– Files, DBMS, mainframe
Ease of development (GUI vs. Java/Pig)
High performance ETL operations (MapReduce)
– Aggregation, sort, filter, copy, reformatting, join,
merge
Seamless high performance sort
18
Syncsort Confidential and Proprietary - do not copy or distribute
Thank you

Syncsort et le retour d'expérience ComScore

  • 1.
    High Performance ETLin a #BigData #Hadoop context Steven Haddad – Senior Software Architect Stéphane Heckel – Partner Manager Hadoop User Group - September 12th 2012
  • 2.
    Syncsort – SolvingBig Data Breakpoints for 40 years Company Track Record • Global Software Company • 40+ Years of Performance Innovation • 25+ Patents related to unique and unparalleled integration technology Large Established Customer Base • 16,000+ deployments • 68 Countries • Across all verticals 2 Expertise & Specialism • Leading provider of high-performance data integration solutions • Data Integration Acceleration and Cost Optimization • Delivering Cost Reduction Initiatives whilst delivering superior performance • Typical TCO reduction of 50% - 75% • Customer ROI within 12 months • DATA SERVICES • FINANCE • INSURANCE & HEALTHCARE TRAVEL & TRANSPORT • RETAIL • TELECOMMUNICATIONS
  • 3.
    A Fully IntegratedArchitecture for High-performance ETL 3 User Interface Task Editor │ Job Editor SDK Shared File-based Metadata Repository Data Lineage Metadata Interchange Global Search Impact Analysis Small Footprint ETL Engine Self-tuning Optimizer Native, Direct I/O Access Install in Minutes. Deploy in Weeks. Never Tune Again. High Performance Connectivity Mainframe Files / XML Appliances Hadoop Cloud Real Time Template- driven Design DMExpress Server Engine High Performance Transformation s High Performance Functions Automatic Continuous Optimization
  • 4.
    4 Syncsort’s Hadoop valueproposition Syncsort Value proposition on Hadoop
  • 5.
    Syncsort Goes BeyondBasic Connectivity to Enhance Hadoop and Facilitate Wider Adoption  HDFS connectivity: Ability to move data in & out of Hadoop file system  Enhanced usability: Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework  Contribute to the Open Source Community: Enhance Hadoop sort framework for everyone. Make it more modular, flexible, extensible  Accelerate Hadoop: Address existing drawbacks in Hadoop native sort by providing a simple, self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance 5 Syncsort Confidential and Proprietary - do not copy or distribute
  • 6.
    Optimizing Hadoop Deployments DMExpressdelivers high-performance connectivity and processing capabilities to optimize Hadoop environments Extract Preprocess & Compress Load RDBMS Appliances Cloud XML Mainframe Files Data Node Data Node Data Node Data Node HDFS Sort Aggregate Join Compress Partition 0 50 100 150 Load Time (min) Elapsed Processing Time HDFS Put DMExpress Connect to virtually any source Pre-process data to cleanse, validate, & partition for better and faster Hadoop processing and significant storage savings Load data up to 6x faster! 6
  • 7.
    DMExpress – HDFSConnectivity HDFS DMExpress Input Load HDFS – Partition the output for parallel loading – Makes full use of network bandwidth with reduced elapsed time – Hadoop/DMExpress can process wildcard input files from HDFS Extract HDFS – DMExpress can read wildcard inputs in parallel 7 Distributions supported – Cloudera CDH3u3 – Hortonworks Data Platform 1.0.7 – Greenplum HD 1.1
  • 8.
    DMExpress Accelerates LoadingHDFS HDFS Load – 20 partitions – Uncompressed input file size from 10GB to 100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 8 3x-6x Faster!
  • 9.
    DMExpress Accelerates LoadingHDFS HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 9 6x Faster!
  • 10.
    Enabling Storage Savingsand Accelerating Performance with DMExpress • Load data faster into HDFS • Store twice as much data on the cluster • Improve overall performance by pre-sorting, cleansing and partitioning • Achieve higher rate of parallelism • Realize up to 75TB of data storage savings a month DMExpress is enabling comScore to 32B records / day Load files Cleanse,sort, compress, partition. Load to HDFS Post-processing & analysis DMExpress Node Node Node Node HDFS Hadoop 10
  • 11.
  • 12.
    DMExpress Hadoop Integration ContributeMapReduce code changes to Apache Hadoop (JIRA MAPREDUCE-2454) – Allow external sort to be plugged in – Improve developer productivity • Develop MapReduce jobs via DMExpress GUI – Aggregations, cleansing/filtering, reformatting, etc. – Seamlessly accelerate MapReduce performance • Replace Map output sorter • Replace Reduce input sorter https://issues.apache.org/jira/browse/MAPREDUCE-2454 Syncsort Confidential and Proprietary - do not copy or distribute 12
  • 13.
    DMExpress Accelerates HDFSLoading HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 13 Syncsort Confidential and Proprietary - do not copy or distribute 6x Faster!
  • 14.
    Accelerate Development &Remove Barriers to Adoption Use DMExpress to Accelerate Development and Optimize MapReduce Jobs MapReduce Development: Χ Lots of manual coding: Χ MapReduce, Pig, Java Χ Limited skills supply Χ Heavy learning curve DMExpress Hadoop Edition:  No coding required  Leverages the same skills most IT organizations already have  New resources can be trained in just 3 days Syncsort Confidential and Proprietary - do not copy or distribute 14
  • 15.
    Native MapReduce DMExpressExecution DMExpress Hadoop is not generating code (i.e., Java, Pig, Python) DMExpress Hadoop runs native on each data node on the cluster – DMExpress is installed on each data node – Same benefits as High-performance ETL Issues with code generation – Requires re-compilation with every change – May still require MR skills – Ongoing issues with efficiency of generated code 15 Sy nc DMX DMX DMX DMX Hadoop Cluster DMX
  • 16.
    0 500 1000 1500 2000 2500 3000 0 500 10001500 2000 2500 3000 Elapsed Time (sec) File Size (GB) TPC-H - Aggregation Java Pig DMExpress DMExpress Hadoop Edition Provides Significant Performance Improvements TPC-H Benchmark – Filter & Aggregation – GZIP compression – Uncompressed input file size from 100GB to 2.4TB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH3U2 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Read : 870MBs, Write: 660MBs – Memory: 94 GB TPC-H Benchmark 16 Syncsort Confidential and Proprietary - do not copy or distribute Almost 2x Faster than Java; Over 2x Faster Pig
  • 17.
  • 18.
    DMExpress Hadoop EditionBenefits High performance HDFS load and extract – DMExpress partitioning allows taking advantage of full network bandwidth – High performance parallel load from HDFS to GP DB Integration with diverse set of sources – Files, DBMS, mainframe Ease of development (GUI vs. Java/Pig) High performance ETL operations (MapReduce) – Aggregation, sort, filter, copy, reformatting, join, merge Seamless high performance sort 18 Syncsort Confidential and Proprietary - do not copy or distribute
  • 19.