Why Hadoop is important to Syncsort
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Why Hadoop is important to Syncsort

on

  • 562 views

An overview of how Syncsort have improved Hadoop in the open source world, and what they provide on top of the standard distributions.

An overview of how Syncsort have improved Hadoop in the open source world, and what they provide on top of the standard distributions.

Statistics

Views

Total Views
562
Views on SlideShare
558
Embed Views
4

Actions

Likes
0
Downloads
8
Comments
0

2 Embeds 4

https://twitter.com 3
http://eventifier.co 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The ability to process and analyze mainframe data with Hadoop can open up a wealth of opportunities by delivering deeper analytics, at lower cost. Unfortunately, there are no native Hadoop ETL capabilities for mainframe. Simply ingesting mainframe data involves lots of manual effort and coding, plus a combination of mainframe and Hadoop skills that are nearly impossible to find. The Use Case Accelerators for Mainframe Connectivity and Translation combine decades of mainframe expertise with state-of-the-art Hadoop capabilities to provide a painless and seamless approach to leverage mainframe data. Read files directly from the mainframe, parse and transform the data – packed decimal, COMP, EBCDIC/ASCII, multi-format records, and more - without installing any software on the mainframe and without writing any code. SAMPLE EBCDIC data!

Why Hadoop is important to Syncsort Presentation Transcript

  • 1. Why was it so important to us To open the MapReduce framework 12/11/2013 Syncsort Confidential and Proprietary - do not copy or distribute
  • 2. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 2
  • 3. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 3
  • 4. Syncsort For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data! Integrating Big Data… Smarter! Our customers are achieving the impossible, every day! • 50% of all mainframes run Syncsort • 1,500 Mainframe Customers: Most used & trusted 3rd party mainframe software • Speed leader for ETL & Sort • A history of innovation • 25+ Issued & Pending Patents • Large global customer base • 15,000+ deployments in 68 countries • First-to-market, fully integrated approach to Hadoop ETL Syncsort Confidential and Proprietary - do not copy or distribute Key Partners 4
  • 5. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 5
  • 6. Smart Contributions to Improve Hadoop Augmenting Critical Batch Processing Capabilities JIRA Description 4807 Allow MapOutputBuffer to be pluggable 4808 Allow Reduce-side merge to be pluggable 4809 Make classes required for 2454 public 4812 Create reduce input merger plug-in 4842 Shuffle race can hang reducer 2461 HDFS file name globbing in libhdfs 4482 Backport of 2454 to MapReduce 1 & 1.2 Plugin Shipping on CDH 4.2 and later Syncsort Confidential and Proprietary - do not copy or distribute 6
  • 7. Opening the MapReduce Framework Here and here to replace MapReduce native sort Mapper Output Sorter Here to perform functional logic on our engine Syncsort Confidential and Proprietary - do not copy or distribute Shuffle Input Sorter Reducer Here to perform functional logic on our engine 7
  • 8. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 8
  • 9. Syncsort: Just integrating data … faster  A simple DI engine easy to deploy, operate, and administer  ETL like development GUI  Auto-tuning  Best patented algorithms Sort Join Aggregate Copy + Syncsort Confidential and Proprietary - do not copy or distribute Merge  Fast, fast, faster than any other  The more data the better 9
  • 10. From Data to Big Data $$$ Variety Velocity Quarterly Monthly Weekly Daily Intra-day $$$ Right / Real-time $$$ Volume Mainframe PC Internet Revolution Mobile & Social Media Revolution 70s 60s 80s Syncsort Confidential and Proprietary - do not copy or distribute 90s 2000s 2010s Next? 10
  • 11. Smart Architecture Hadoop Integration… for Real (No Code Generation. No Compiling. No Bolts. No Nuts!)  Runs natively within MapReduce  Small footprint installs on every node  Open source contributions extend capabilities of MapReduce Hadoop Cluster Unleash Hadoop’s Potential Syncsort Confidential and Proprietary - do not copy or distribute Pluggable sort Expanded use cases (i.e. “No sort” option) Vertical scalability Design flexibility (MapMapReduceReduce) No need to worry about this… 11
  • 12. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 12
  • 13. Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe Because Mainframe Is Big Data Too! Connect • Read files directly from mainframe • No software required on mainframe • Already installed on 50% of mainframes Translate • Parse & transform: packed decimal, EBCDIC/ASCII, multi-format • No coding required Load & • Load directly to HDFS • Offload batch data processing Process • Find more insights Syncsort Confidential and Proprietary - do not copy or distribute 13
  • 14. Syncsort DMX-h + Cloudera Manager Cloudera Manager CDH Cluster + ISV software Support Integration Monitoring Syncsort DMX-h A P I Management Installation CDH Nodes Syncsort Confidential and Proprietary - do not copy or distribute DMX-h on every CDH node 14
  • 15. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 15
  • 16. Test cases Sort Acceleration – Terasort • Run terasort with DMX-h and without DMX-h in various configurations to compare performance. ETL – Use DMX-h to perform several different ETL jobs and compare against equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0). • File Change Data Capture (CDC) • Web Log Aggregation Syncsort Confidential and Proprietary - do not copy or distribute 16
  • 17. File CDC DMX-h Pig Java 149 Lines of Code Syncsort Confidential and Proprietary - do not copy or distribute 70 Lines of Code
  • 18. Web Log Aggregation DMX-h Pig Java 94 Lines of Code Syncsort Confidential and Proprietary - do not copy or distribute 48 Lines of Code
  • 19. Cluster Configuration – DMX-h Ran on 763 Nodes! Cluster Specs: – 763 node cluster • • • • 1 node – job tracker 1 node - name node 1 node – secondary name node 760 data and task nodes Hadoop cluster configuration changes (from defaults): – 128 MB HDFS Block size (file.blocksize) – 1.5 GB map/ 4GB reduce task JVM memory (mapred.child.java.opts) – Maximum 22 map tasks and 4 reduce tasks per node (mapred.tasktracker.map.tasks.maximu m& mapred.tasktracker.reduce.tasks.maximu m) Syncsort Confidential and Proprietary - do not copy or distribute Cluster Node Specs: – 12 cores - Dual Intel Westmere (Hexcore) CPUs, 2.93 GHz, 12 MB Cache – 48GB DDR3 RDIMM Memory – 12 x 2TB 3.5” drives Seagate 7200rpm. – Disk 0 + Disk 1 are RAID1 (mirrored) for OS. • 100 MB/Sec write • 115 MB/Sec read – 10 single disk JBOD – Mellanox ConnectX®-3 VPI NIC (Supported data rates 40GbE;10GbE) – RHEL 6.1 64-bit – Java 1.6 (jdk.x86_64-2000:1.6.0_29fcs) 19
  • 20. Sort Acceleration - Terasort Use Case TERASORT TERASORT TERASORT TERASORT TERASORT TERASORT TERASORT Native/A Mem ETL or lternativ Elapsed ory Sort e DMX-h Time Native/Alterna DMX-h Impro Native/Alter Accele Alterna Data Size Elapsed Elapsed Improv tive Memory Physical veme native CPU ration tive (GB) time Time ement (GB) Memory (GB) nt Time Sort Accele ration Native 512 0:01:47 0:01:45 2% 12,863 12,873 0% 114,297 Sort Accele ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 Sort Accele ration Native 1,536 0:04:02 0:01:23 66% 14,684 14,694 0% 287,055 Sort Accele ration Native 4,096 0:03:31 0:02:29 29% 31,520 31,549 0% 927,379 Sort Accele ration Native 10,242 0:08:51 0:05:14 41% 47,935 47,951 0% 2,835,927 Sort Accele ration Native 20,484 0:14:55 0:12:28 16% 106,153 105,239 1% 6,112,296 Sort Accele ration Native 102,400 1:12:12 0:51:59 28% 387,262 387,211 0% 30,436,624 Syncsort Confidential and Proprietary - do not copy or distribute Native/ CPU Alterna Impro tive DMX-h DMX-h CPU veme MB/SecMB/Sec Time nt /Node /Node 62,491 45% 6.5 6.6 98,972 49% 9.3 19.4 143,759 50% 8.6 25.0 380,442 59% 26.2 37.0 1,460,101 49% 26.4 44.6 3,696,727 40% 31.0 37.4 16,589,332 45% 32.3 44.9 20
  • 21. File CDC Native/ ETL or Native/Alt Elapse Memor Alterna DMXSort Data ernative DMX-h d Time Native/Altern DMX-h y Native/Alt CPU tive h AccelerAlterna Size Elapsed Elapsed Improv ative Memory Physical Improv ernative DMX-h Improv MB/Se MB/Se Use Case ation tive (GB) time Time ement (GB) Memory (GB) ement CPU Time CPU Time ement c/Nodec/Node FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2 FileCDC ETL Pig 450 0:05:11 0:01:58 62% 243,834 182,869 25% 243,834 182,869 25% 1.9 5.3 FileCDC ETL Pig 1,515 0:07:49 0:03:44 52% 845,263 557,226 34% 845,263 557,226 34% 4.4 9.4 Syncsort Confidential and Proprietary - do not copy or distribute 21
  • 22. Web Log Aggregation Use Case WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes Data Native/Alter Altern Size native ative (GB) Elapsed time DMX-h Elapsed Time Native/A Elapsed lternativ Time Memory Native/Alter CPU e DMX-h Improve Native/Alternativ DMX-h Physical Improve native CPU DMX-h CPU Improve MB/Sec/ MB/Sec/ ment e Memory (GB) Memory (GB) ment Time Time ment Node Node Pig 2,067 0:01:12 0:00:58 19% 13,499 7,813 42% 145,972 56,496 61% 40.1 49.8 Pig 4,135 0:01:42 0:01:23 19% 18,003 15,579 13% 300,627 152,390 49% 56.1 69.6 Pig 10,240 0:05:16 0:02:04 61% 40,773 39,091 4% 807,473 335,537 58% 45.3 115.4 Pig 20,480 0:07:54 0:06:58 12% 78,654 78,128 1% 1,339,453 568,107 58% 60.4 68.4 Syncsort Confidential and Proprietary - do not copy or distribute 22
  • 23. Test Drive DMX-h: Bridge the Gap Between Big Iron & Big Data! • Self-contained image • Use case accelerators for • mainframe, Hadoop and more! Running on CDH A Smarter Approach… ( + ) www.syncsort.com/try …and Quite Possibly The Only Approach! 23