Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Solving Performance Problems on Hadoop

1,094 views

Published on

My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.

Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.

Published in: Data & Analytics
  • Be the first to comment

Solving Performance Problems on Hadoop

  1. 1. Solving performance problems on Hadoop Moving analytic workloads into production 1 Tyler Mitchell Sr. Software Engineer Actian Center of Excellence
  2. 2. Topics How we got (stuck) here Performance best practises Sample business cases Benchmarking results 2
  3. 3. Actian’s Lineage Ingres – 1970’s Versant – 1988 ParAccel – 2006 Pervasive – 1982 Vectorwise – 2003 3 Actian
  4. 4. Actian at a Glance 4 10,000+ 8 Countries; 7 US Cities HQ Palo Alto 400+ Employees Customers 3 Businesses Banking, Insurance Telecom and Media Data Management Data Integration Big Data Analytics
  5. 5. How We Got (Stuck) Here 5
  6. 6. Accidental Hadoop Tourist – Brief History 6 DataBusiness Data Capture Data Management & Integration Analytics Query & Analyze Solutions Problem Solved
  7. 7. Accidental Hadoop Tourist – Brief History 7 DataBusiness Data Capture Data Management & Integration Analytics Solutions ??????
  8. 8. Accidental Hadoop Tourist – Brief History 8 DataBusiness Data Capture Data Management & Integration Analytics ??? Solutions ???
  9. 9. Modern, best-in-class analytic database technology provides: 9 Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business The ability to make data driven business decisions using a massively scalable platform Decisive reduction in the cost of high performance analytics at scale Performance that can meet all SLAs Full leverage of existing SQL skills while deploying a modern analytic infrastructure Grow Revenue Reduce Cost Mitigate Risk Create New Business Business Solution Architecture Challenges
  10. 10. Wide Ranges of Use Cases 10 Financial Services Advanced Credit Risk Analytics across billions of data points Internet Scale Application Predictive Analytics across hundreds of millions of customers Media Data Science and Discovery across trillions of IoT events Dept of Defense Cyber-Security: Network intrusion models every second Credit Card Processing Fraud detection every milli- second
  11. 11. Performance Best Practises 11
  12. 12. 3 Essential Big Data Concepts 12 0. Take nothing for granted 1. Partitioning vs Data skew 2. Data types matter 3. Maximize memory / minimize bottlenecks 4. Take nothing for granted
  13. 13. 6 Game Changing Database Innovations 13
  14. 14. 6 Game Changing Database Innovations 14 1. Use the CPU! – Vector Processing 2. Minimize bottlenecks – Exploiting Chip Cache 3. Got columnar? 4. Smarter compression 5. Smarter indexing 6. Multi-core matters
  15. 15. Actian VectorH Innovations 15
  16. 16. Big Data Business Use Cases 16
  17. 17. Customer 360: Understanding Experience, Driving Revenue 17 Telecom Challenge Vast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage. Queries took minutes or hours, and sometimes never returned at all. Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk
  18. 18. Customer 360: Initial Architecture 18 Development System • 300+ node cluster • HIVE access • SQL based BI / Data Science • Pre-processed as performance was unacceptable • Views taking days to return snapshot views
  19. 19. Customer 360: Technical Improvements 19 Production Prototype • 30 node cluster (10% of Hive) • Actian Vector on Hadoop solution • SQL based BI / Data Science • No materialized view building required • Join on demand faster than aggregate tables in Hive • Reduced storage requirements • 91TB – two years data, 1100 columns when joined
  20. 20. Customer 360: Understanding Experience, Driving Revenue 20 Results Customer 360 across prior data silos Leveraged for customer retention strategies Predict and take proactive, tailored responses Enables next gen data-driven troubleshooting, impact analysis and root cause analysis • Accelerated operations intelligence • Improved customer experience • Reduced customer churn Impact
  21. 21. Financial Risk: Upgrading Legacy to Meet SLA 21 Challenge Legacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk. In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements
  22. 22. Financial Risk: Upgrading Legacy to Meet SLA 22 Legacy System • Single server architecture, MS SSAS, Oracle - ~30 applications • Pre-processing of desired measures exploding data volumes • Cube and Analysis engines being maxed out as they exceed 1.5TB range • Unable to scale to the desired range of > 200GB/day new data • Impala attempt failed • Highly invested in apps built on Analysis service
  23. 23. Financial Risk: Upgrading Legacy to Meet SLA 23 New Possibilities • Clustered solution – Hadoop 5 and 10 node • No pre-processing cubes, SSAS partly kept • Tested solutions 1TB -> 20TB at a time • Produced interactive queries across large datasets • Focused query results in 2s or less • Processing all data in the database 6s – 80s • 2x nodes ~ 200% speed improvement
  24. 24. Financial Risk: Upgrading Legacy to Meet SLA 24 Results Increased data analyzed by 100X 2–200B rows / 1-20TB Report run in 28 seconds vs. 3 hours Use of application for: • Intra-day reporting (surveillance) • End of day reporting (compliance) • Overnight float investment options • Annual CCAR Analysis ActualGoal
  25. 25. Delivering the Results With Better Engineering 25
  26. 26. Technical Benchmarks 26
  27. 27. Technical Benchmarks – Single Machine 27
  28. 28. Technical Benchmarks – Single Machine 28
  29. 29. Technical Benchmarks: VectorH - SQL on Hadoop 29 TPC-H SF1000 * VectorH vs other platforms, faster by how much? Tuned platforms Identical hardware ** * Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
  30. 30. Actian VectorH Delivers More Efficient File Format 30 Better compression & functionality Vector advantages: • skip blocks via MinMax indexes • sophisticated query processing • efficient block format, esp. 64-bit int
  31. 31. Summary Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels. All Hadoop users can move from development into production while delivering compelling business results. 31
  32. 32. Delivering the Results With Better Engineering 32 VectorH v5 – Spark integration, external table support, and more
  33. 33. SIGMOD 2016 Paper 33
  34. 34. Thank you! tyler.mitchell@actian.com - @1tylermitchell Blogs at Actian.com - MakeDataUseful.com Visit us in booth 503 34

×