Syncsort et le retour d'expérience ComScore

on

  • 1,913 views

 

Statistics

Views

Total Views
1,913
Views on SlideShare
810
Embed Views
1,103

Actions

Likes
1
Downloads
5
Comments
0

3 Embeds 1,103

http://hugfrance.fr 1100
http://translate.googleusercontent.com 2
http://ranksit.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Syncsort et le retour d'expérience ComScore Syncsort et le retour d'expérience ComScore Presentation Transcript

  • Hadoop User Group - September 12th 2012High Performance ETL in a #BigData #Hadoop contextSteven Haddad – Senior Software ArchitectStéphane Heckel – Partner Manager
  • Syncsort – Solving Big Data Breakpoints for 40 yearsCompany Track Record DATA SERVICES •• Global Software Company• 40+ Years of Performance Innovation• 25+ Patents related to unique and FINANCE unparalleled integration technology •Large Established Customer Base INSURANCE & HEALTHCARE• 16,000+ deployments •• 68 Countries• Across all verticals TRAVEL & TRANSPORTExpertise & Specialism• Leading provider of high-performance data integration solutions RETAIL •• Data Integration Acceleration and Cost Optimization• Delivering Cost Reduction Initiatives TELECOMMUNICATIONS whilst delivering superior performance •• Typical TCO reduction of 50% - 75%• Customer ROI within 12 months 2
  • A Fully Integrated Architecture for High-performance ETL Install in Minutes. Deploy in Weeks. Never Tune Again. Task Editor │ Job Editor SDK User Interface Template- driven Design Shared File-based DMExpress Server Engine Metadata Repository High High Small Footprint Performance Performance ETL Engine Impact Data Global Transformation Functions Analysis Lineage Search s Automatic Self-tuning Metadata Interchange Continuous Optimization Optimizer High Performance Connectivity Native, Direct I/O Files / XML Access Appliances Real Time Cloud Hadoop Mainframe 3
  • Syncsort’s Hadoop value proposition Syncsort Value proposition on Hadoop 4
  • Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption  HDFS connectivity: Ability to move data in & out of Hadoop file system  Enhanced usability: Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework  Contribute to the Open Source Community: Enhance Hadoop sort framework for everyone. Make it more modular, flexible, extensible  Accelerate Hadoop: Address existing drawbacks in Hadoop native sort by providing a simple, self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenanceSyncsort Confidential and Proprietary - do not copy or distribute 5
  • Optimizing Hadoop Deployments DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments Extract Preprocess & Compress Load Data Node Cloud Data Node HDFS Data Node Appliances Sort Aggregate Join Data NodeFiles XML RDBMS Load data up to 6x faster! Compress Partition Elapsed Processing Time Mainframe 150 Load Time (min) HDFS Pre-process data to cleanse, 100 Put DMExpress Connect to virtually any validate, & partition for better 50 source and faster Hadoop processing 0 and significant storage savings 6
  • DMExpress – HDFS Connectivity DMExpress HDFS Input Load HDFS– Partition the output for parallel loading Distributions supported– Makes full use of network bandwidth with – Cloudera CDH3u3 reduced elapsed time – Hortonworks Data Platform 1.0.7– Hadoop/DMExpress can process wildcard input files from HDFS – Greenplum HD 1.1 Extract HDFS– DMExpress can read wildcard inputs in 7 parallel
  • DMExpress Accelerates Loading HDFS HDFS Load using DMExpress HDFS Load – 20 partitions – Uncompressed input file size from 10GB to 100GB 3x-6x Cluster Specifications Faster! – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB 8
  • DMExpress Accelerates Loading HDFS HDFS Load using DMExpress HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB 6x Faster! Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB 9
  • Enabling Storage Savings and Accelerating Performance with DMExpress • Load data faster into HDFS • Store twice as much data on the clusterDMExpress is enabling • Improve overall performance by pre-sorting, cleansing andcomScore to partitioning • Achieve higher rate of parallelism • Realize up to 75TB of data storage savings a month Hadoop 32B records / day Node DMExpress Node HDFS Node Node Load files Cleanse,sort, Post-processing & compress, analysis partition. Load to HDFS 10
  • Michael Brown, ChiefScientist, comScore 11
  • DMExpress Hadoop Integration Contribute MapReduce code changes to Apache Hadoop (JIRA MAPREDUCE-2454) – Allow external sort to be plugged in – Improve developer productivity • Develop MapReduce jobs via DMExpress GUI – Aggregations, cleansing/filtering, reformatting, etc. – Seamlessly accelerate MapReduce performance • Replace Map output sorter • Replace Reduce input sorter https://issues.apache.org/jira/browse/MAPREDUCE-2454Syncsort Confidential and Proprietary - do not copy or distribute 12
  • DMExpress Accelerates HDFS Loading HDFS Load using DMExpress HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB 6x Faster! Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GBSyncsort Confidential and Proprietary - do not copy or distribute 13
  • Accelerate Development & Remove Barriers to Adoption Use DMExpress to Accelerate Development and Optimize MapReduce Jobs MapReduce Development: DMExpress Hadoop Edition: Χ Lots of manual coding:  No coding required Χ MapReduce, Pig, Java  Leverages the same skills most IT Χ Limited skills supply organizations already have Χ Heavy learning curve  New resources can be trained in just 3 daysSyncsort Confidential and Proprietary - do not copy or distribute 14
  • Native MapReduce DMExpress Execution DMExpress Hadoop is not generating code (i.e., Java, Pig, DMX Python) DMExpress Hadoop runs native on each data node on the cluster DMX DMX DMX DMX – DMExpress is installed on each data node – Same benefits as High-performance ETL Hadoop Cluster Issues with code generation – Requires re-compilation with every change – May still require MR skills – Ongoing issues with efficiency of generated code 15 Sy nc
  • DMExpress Hadoop Edition Provides Significant Performance Improvements TPC-H Benchmark TPC-H - Aggregation TPC-H Benchmark 3000 Almost 2x – Filter & Aggregation Faster than – GZIP compression 2500 Java; Over 2x – Uncompressed input file size Faster Pig Elapsed Time (sec) from 100GB to 2.4TB 2000 Cluster Specifications – Size: 10+1+1 nodes 1500 Java Pig – Hadoop distribution: CDH3U2 DMExpress – HDFS block size: 256 MB 1000 Hardware Specifications (Per Node) 500 – Red Hat EL 5.8 – Intel Xeon x5670 *2 0 – 6 disks/node 0 500 1000 1500 2000 2500 3000 File Size (GB) – Read : 870MBs, Write: 660MBs – Memory: 94 GBSyncsort Confidential and Proprietary - do not copy or distribute 16
  • Conclusion Syncsort Value proposition on Hadoop 17
  • DMExpress Hadoop Edition Benefits High performance HDFS load and extract – DMExpress partitioning allows taking advantage of full network bandwidth – High performance parallel load from HDFS to GP DB Integration with diverse set of sources – Files, DBMS, mainframe Ease of development (GUI vs. Java/Pig) High performance ETL operations (MapReduce) – Aggregation, sort, filter, copy, reformatting, join, merge Seamless high performance sortSyncsort Confidential and Proprietary - do not copy or distribute 18
  • Thank you