Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East talk by Myles Collins


Published on

Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East talk by Myles Collins

  1. 1. HPE Vertica and Sparkitecture Overview Myles Collins
  2. 2. Vertica Analytics Platform It’s a database, but: • Fast - Boost performance by 500% or more • Scalable - Handles huge workloads at high speeds. • Standard - No need to learn new languages or add complexity (ANSI SQL, ACID) • Costs - Significantly lower cost over legacy platforms
  3. 3. Apache Kafka + Spark + HPE Vertica for both Batch and Streaming Analytics HPE Vertica Analytics Platform Analytics/ Reporting Data Generation OLTP/ODS Logs (Apps, Web, Devices) User tracking Operational Metrics Distributed Messaging System ETL Stream processing SQL on Hadoop Hive ORC Parquet Raw Data Topics JSON, AVRO Processed Data Topics
  4. 4. Oil and Gas Customer 4
  5. 5. Foundations of Vertica 5 Columnar storage Compression MPP scale-out Distributed query Projections Speeds query time by reading only necessary data Lowers costly I/O to boost overall performance Provides high scalability on clusters with no name node or other single point of failure Any node can initiate the queries and use other nodes for work. No single point of failure Combine high availability with special optimizations for query performance A B D C E A Memory CPU Disk
  6. 6. Advanced In-database Analytics Allows for: – Standard functionality that performs at scale SQL ‘99 Allows for: – Sessionization – Conversion analysis – Fraud detection – Fast Aggregates (LAP) SQL Extensions Allows for: – Machine learning – Custom data mining – Specialized parsers SDKs – Pattern matching – Event series joins – Time series – Event-based windows – Aggregate – Analytical – Window functions – Graph – MonteCarlo – Statistical – Geospatial Analytics – Java – C++ – R Connection – ODBC/JDBC – HIVE – Hadoop – Flex zone Allows for: – Statistical modeling – Cluster analysis – Predictive analytics In-database Analytics – Regression testing – K-means – Statistical modeling – Classification algorithms – Pagerank – Text mining – Geospatial 6
  7. 7. Data Transformation Messaging BI & Visualization ETL R Java Python ODBC,JDBC,OLEDB UserDefinedLoads Geospatial EventSeries Time series TextAnalytics PatternMatching Regression User-DefinedFunctions External tablesto analyze inplace Integratedwith Open Source Innovation Throughan Ecosystem-Friendly Architecture SQL Real-Time Machine Learning User Defined Storage Security C++
  8. 8. Spark Kafka Hadoop Embracing an open source architecture 8 • Vertica performs optimized load from Spark • Load data from Vertica to Spark • Read native formats like ORC and Parquet • Any Hadoop • Run ON the Hadoop cluster or ON Vertica cluster • Share data between applications that support Kafka • Data streaming into Vertica
  9. 9. Vertica Enterprise Unique Value to expand the data warehouse 9 Hadoop Data Lake Vertica Big Data Warehouse CREATE TABLE customer_visits ( customer_id bigint, visit_num int) PARTITIONED BY (page_view_dt date) STORED AS ORC; Customer information in Hadoop Customer information in Data Warehouse SELECT customers.customer_id FROM orders RIGHT OUTER JOINcustomers ON orders.customer_id = customers.customer_id GROUP BY customers.customer_id HAVING COUNT(orders.customer_id) = 0; Vertica Engine Querying data that sits BOTH in the data warehouse and Hadoop is our unique value. Most solutions require that you move the data. ROS § Leveraging Web Logs to gain customer insight § Sensor and IOT data for pre-emptiveservice § Marketing Programs Tracking § Tracking impact of application updates § Many more uses
  10. 10. We did the leg work 10 3 TBs of test data HPE Proliant DL 380 gen 9 Five nodes ROS, Parquet and ORC format. 99 TPC-DS Queries
  11. 11. Other technologies can’t run all the queries 11 99 80 59 29 19 40 70 V E RT I CA E NT E RP RI S E I M P A LA HI V E O N T E Z HI V E O N S P A RK PASSING AND FAILING TPC-DS QUERIES PASS FAIL
  12. 12. Boosting Spark with Vertica 12 • Vertica Enterprise completes the benchmark in 4% of the time of Hive on Spark • Hive on Spark could not complete 70 of the 99 queries at all. Those queries were not compared. 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Hive on Spark Vertica Enterprise Seconds to complete TPC-DS benchmark queries (among universally runnablequeries) About 11hours About 24minutes
  13. 13. HPE Vertica Architected to embrace ecosystem innovation BI/visualization Data transformation Platform Advancedanalytics Cloud
  14. 14. Community Edition - Free download 1TB, 3 nodes - Learn More About – and Try! - HPE Vertica
  15. 15. Thank you