Your SlideShare is downloading. ×
0
×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hug syncsort etl hadoop big data

3,914

Published on

High Performance ETL in a #BigData #Hadoop context

High Performance ETL in a #BigData #Hadoop context

Published in: Technology
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,914
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • TY>> I removed the CDH3 U2, U2 part as MapR benchmark doesn’t show against which Cloudera version they were running.
  • TY>> I removed the CDH3 U2, U2 part as MapR benchmark doesn’t show against which Cloudera version they were running.
  • "We can store twice as much data on the cluster, and we also use it to improve performance. One big problem it solved was the ability to chunk and split the large files we have into files that fit perfectly into the chunks on Hadoop. This enables us to have a higher rate of parallelism on compressed files while reducing our costs for disk on the cluster.“ --Mike Brown, CTO, comScore
  • In case of sort accelarator => dmexpress is only performing the sort => dmexpress tasks are automatically generated => value: better performance, lower resource footprintIn case of second approach, => one can develop tasks for any processing, e.g. sort or aggregation => value: better performance, lower resource footprint and developing via UI (less coding in java)
  • How would you do a join?Since a file is distributed in HDFS, the file can be loaded in parallel to a MPP EDW
  • Transcript

    1. Hadoop User Group - September 12th 2012High Performance ETL in a #BigData #Hadoop contextSteven Haddad – Senior Software ArchitectStéphane Heckel – Partner Manager - @hsteph – sheckel@syncsort.com +33624654132
    2. Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption  HDFS connectivity: Ability to move data in & out of Hadoop file system  Enhanced usability: Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework  Contribute to the Open Source Community: Enhance Hadoop sort framework for everyone. Make it more modular, flexible, extensible  Accelerate Hadoop: Address existing drawbacks in Hadoop native sort by providing a simple, self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenanceSyncsort Confidential and Proprietary - do not copy or distribute 2
    3. A Fully Integrated Architecture for High-performance ETL Install in Minutes. Deploy in Weeks. Never Tune Again. Task Editor │ Job Editor SDK User Interface Template- driven Design Shared File-based DMExpress Server Engine Metadata Repository High High Small Footprint Performance Performance ETL Engine Impact Data Global Transformation Functions Analysis Lineage Search s Automatic Self-tuning Metadata Interchange Continuous Optimization Optimizer High Performance Connectivity Native, Direct I/O Files / XML Access Appliances Real Time Cloud Hadoop Mainframe 3
    4. Optimizing Hadoop Deployments DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments Extract Preprocess & Compress Load Data Node Cloud Data Node HDFS Data Node Appliances Sort Aggregate Join Data Node Files XML RDBMS Load data up to 6x faster! Compress Partition Mainframe Total Elapsed Time (hrs) 8 Hadoop Pre-process data to cleanse, 6 Put Connect to virtually any 4 DMExpress validate, & partition for faster, 7.7 hrs 1.1 hrs source more efficient Hadoop 2 processing and significant 0 storage savingsSyncsort Confidential and Proprietary - do not copy or distribute 4
    5. DMExpress Provides Out-of-the-box HDFSConnectivity DMExpress HDFS Input Load HDFS – Partition the output for parallel loading – Makes full use of network bandwidth with reduced elapsed time Distributions supported – Hadoop/DMExpress can process wildcard – Cloudera CDH4, CDH3u3 input files from HDFS – Hortonworks Data Platform 1.0.7 Extract HDFS – Greenplum HD 1.1 – DMExpress can read wildcard inputs in parallel 5 Sy nc
    6. DMExpress Accelerates Loading HDFS HDFS Load using DMExpress HDFS Load – 20 partitions – Uncompressed input file size from 10GB to 100GB 3x-6x Cluster Specifications Faster! – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB 6
    7. DMExpress Accelerates Loading HDFS HDFS Load using DMExpress HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB 6x Faster! Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB 7
    8. Enabling Storage Savings and Accelerating Performance with DMExpress • Load data faster into HDFS • Store twice as much data on the clusterDMExpress is enabling • Improve overall performance by pre-sorting, cleansing andcomScore to partitioning • Achieve higher rate of parallelism • Realize up to 75TB of data storage savings a month Hadoop 32B records / day Node DMExpress Node HDFS Node Node Load files Cleanse,sort, Post-processing & compress, analysis partition. Load to HDFS 8
    9. Michael Brown, ChiefScientist, comScore 9
    10. DMExpress Hadoop Integration Contribute MapReduce code changes to Apache Hadoop (JIRA MAPREDUCE-2454) – Allow external sort to be plugged in – Improve developer productivity • Develop MapReduce jobs via DMExpress GUI – Aggregations, cleansing/filtering, reformatting, etc. – Seamlessly accelerate MapReduce performance • Replace Map output sorter • Replace Reduce input sorter https://issues.apache.org/jira/browse/MAPREDUCE-2454Syncsort Confidential and Proprietary - do not copy or distribute 10
    11. Accelerate Development & Remove Barriers to Adoption Use DMExpress to Accelerate Development and Optimize MapReduce Jobs MapReduce Development: DMExpress Hadoop Edition: Χ Lots of manual coding:  No coding required Χ MapReduce, Pig, Java  Leverages the same skills most IT Χ Limited skills supply organizations already have Χ Heavy learning curve  New resources can be trained in just 3 daysSyncsort Confidential and Proprietary - do not copy or distribute 11
    12. Native MapReduce DMExpress Execution DMExpress Hadoop is not generating code (i.e., Java, Pig, DMX Python) DMExpress Hadoop runs native on each data node on the cluster DMX DMX DMX DMX – DMExpress is installed on each data node – Same benefits as High-performance ETL Hadoop Cluster Issues with code generation – Requires re-compilation with every change – May still require MR skills – Ongoing issues with efficiency of generated code 12 Sy nc
    13. DMExpress Hadoop Edition Provides Significant Performance Improvements TPC-H Benchmark TPC-H - Aggregation TPC-H Benchmark 3000 Almost 2x – Filter & Aggregation Faster than – GZIP compression 2500 Java; Over 2x – Uncompressed input file size Faster Pig Elapsed Time (sec) from 100GB to 2.4TB 2000 Cluster Specifications – Size: 10+1+1 nodes 1500 Java Pig – Hadoop distribution: CDH3U2 DMExpress – HDFS block size: 256 MB 1000 Hardware Specifications (Per Node) 500 – Red Hat EL 5.8 – Intel Xeon x5670 *2 0 – 6 disks/node 0 500 1000 1500 2000 2500 3000 File Size (GB) – Read : 870MBs, Write: 660MBs – Memory: 94 GBSyncsort Confidential and Proprietary - do not copy or distribute 13
    14. DMExpress Hadoop Edition Benefits High performance HDFS load and extract – DMExpress partitioning allows taking advantage of full network bandwidth – High performance parallel load from HDFS to GP DB Integration with diverse set of sources – Files, DBMS, mainframe Ease of development (GUI vs. Java/Pig) High performance ETL operations (MapReduce) – Aggregation, sort, filter, copy, reformatting, join, merge Seamless high performance sortSyncsort Confidential and Proprietary - do not copy or distribute 14
    15. Syncsort – Solving Big Data Breakpoints for 40 yearsCompany Track Record DATA SERVICES •• Global Software Company• 40+ Years of Performance Innovation• 25+ Patents related to unique and FINANCE unparalleled integration technology •Large Established Customer Base INSURANCE & HEALTHCARE• 16,000+ deployments •• 68 Countries• Across all verticals TRAVEL & TRANSPORTExpertise & Specialism• Leading provider of high-performance data integration solutions RETAIL •• Data Integration Acceleration and Cost Optimization• Delivering Cost Reduction Initiatives TELECOMMUNICATIONS whilst delivering superior performance •• Typical TCO reduction of 50% - 75%• Customer ROI within 12 months 15

    ×