Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Fast Path to Building Operational Applications with Spark

527 views

Published on

Nikita Shamgunov at Spark Summit East 2017

Published in: Data & Analytics
  • Be the first to comment

The Fast Path to Building Operational Applications with Spark

  1. 1. Nikita Shamgunov, CTO and Co-founder of MemSQL Spark Summit East | Boston | 9 February 2017 The Fast Path to Building Operational Applications with Spark
  2. 2. About Me Nikita Shamgunov Co-founder and Chief Technology Officer, MemSQL
  3. 3. ▪ Every piece of technology is scalable ▪ Analyzing data from hundreds of thousands of machines ▪ Delivering immense value in real-time • Real-time code deployment • Detecting anomalies • A/B testing results ▪ Fundamentally making the business faster by providing data at your fingertips An Insider’s View at Facebook
  4. 4. Imagine scaling a database on industry standard hardware. Need 2x the performance? Add 2x the nodes.
  5. 5. ▪ About MemSQL ▪ Using MemSQL Spark Connector ▪ Use Cases and Case Studies ▪ Entity Resolution Today in My Talk
  6. 6. What is MemSQL?
  7. 7. ▪ Scalable and elastic • Petabyte scale • High Concurrency • System of record ▪ Real-time • Operational ▪ Compatible • ETL • Business Intelligence • Kafka • Spark MemSQL - Hybrid Cloud Data Warehouse ▪ Deployment • Managed service in the Cloud • On-premises ▪ Community Edition • Unlimited scale • Limited high availability and security features
  8. 8. MemSQL Confidential9 Product or Services Scores for Operational Data Warehouse Critical Capabilities for Data Warehouse and Data Management Solutions for Analytics Gartner, July 2016
  9. 9. Keeping Pace On-demand economy Real-Time Data Predictive Analytics
  10. 10. Understanding MemSQL and Spark
  11. 11. Easy Deployment of Real-Time Data Pipelines ▪ High-throughput distributed messaging system ▪ In-memory execution engine ▪ Hybrid Cloud Data Warehouse ▪ Publish and subscribe to Kafka “topics” ▪ High level operators for procedural and programmatic analytics ▪ Full transactions and complete durability Amazon Kinesis
  12. 12. Use Spark and Operational Databases Together Spark Operational Databases Interface Programmatic Declarative Execution Environment Job Scheduler SQL Engine and Query Optimizer Persistent Storage Use another system Built-in
  13. 13. MemSQL Spark 2 Connector
  14. 14. MemSQL Spark Connector Architecture 15 CLUSTERCLUSTER Spark RDD MemSQL Table(s) Cluster-wide Parallelization | Bi-Directional
  15. 15. Operationalize Models Built in Spark Stream and Event Processing Extend MemSQL Analytics Live Dashboards and Automated Reports MemSQL and Spark Use Cases
  16. 16. Operationalize Models Built in Spark 17 Enterprise Consumption Data into Spark Model Creation Model Persistence Results Set CLUSTER
  17. 17. Stream and Event Processing 18 Enterprise Consumption Real-Time Streaming Data Data Transformation Persistent, Queryable Format CLUSTER
  18. 18. Extend MemSQL Analytics 19 Applications, Data Streams Interactive Analytics, ML Access to Live Production Data CLUSTER Real-Time Replica REPLICATED CLUSTER
  19. 19. Live Dashboards and Automated Reports 20 Live Dashboards Custom Reporting Access to Live Production Data SQL Transactions and Analytics CLUSTER
  20. 20. MemSQL Spark Connector via Spark Packages The memsql-spark-connector is now available via Spark Packages: http://spark-packages.org/ https://spark-packages.org/package/memsql/memsql-spark-connector You can use it with any Spark command: > $SPARK_HOME/bin/spark-shell --packages com.memsql:memsql-connector_2.11:2.0.1 Also available on Maven http://search.maven.org/#artifactdetails%7Ccom.memsql%7Cmemsql-connector_2.11%7C2.0.1%7Cjar And the Github repository https://github.com/memsql/memsql-spark-connector
  21. 21. Customer Spark Case Studies
  22. 22. MemSQL Confidential 23 Reducing delay in “freshness of data” from two hours to 10 minutes + https://www.enterprisetech.com/2016/12/09/managing-30b-bid-requests/
  23. 23. TECHNICAL BENEFITS ▪ 10x faster data refresh, from hours to minutes ▪ Run ad-hoc queries on log-level data within seconds THE MANAGE REAL-TIME ARCHITECTURE REAL-TIME ANALYTICS Real-Time inputs
  24. 24. MemSQL Confidential25 Goldman Sachs at Kafka Summit April 2016 http://www.confluent.io/kafka-summit-2016-users-real-time-analytics-visualized-with-kafka Real-Time Analytics Visualized w/ Kafka+Spark+MemSQL+ZoomData
  25. 25. Entity Resolution at Scale
  26. 26. Problem Statement Employees have many opportunities to take advantage of their insider knowledge and position of trust within a company. This includes: ▪ Preferential treatment to family or friends ▪ Fraud under someone else’s name In many cases, proximity is one of the most common traits of those they proxy their activities through. MemSQL can quickly process the massive volume of calculations needed to identify these relationships and iterate on new algorithms. 27
  27. 27. 28 Problem Size Target Group 100,000 Population 50 million X = Comparisons 5 trillion Parallelize ● filters ● projections ● entity resolution Distributed, in-memory, massively parallel processing From 5 trillion to 50 million
  28. 28. Rank Probabilities Relationship Similar entity Comparisons Levenshtein SoundEx Metaphone On Email and Name Geospatial filter 50 meters Examples for Demo 29 MemSQL Duke (Spark) Results Rank Probabilities Relationship Similar entity Comparisons Levenshtein SoundEx Metaphone On Email and Name Index filter Last names are equal MemSQL Duke (Spark) Results Example 1 Example 2
  29. 29. 30 Scalability Cluster 288 cores → 3 mins runtime Runtime scales linearly with number of cores 8 x c4.8xlarge Want speed? Add cores!
  30. 30. Cluster size: 8 machines, c4.8xlarge, 36 cores, 60 GB RAM • 2 leaf nodes per machine, each with 9 partitions • this gives us ~2 cores per partition in the cluster - one core is going to be at 100% CPU during the computation, the other is used for Spark + Duke + Misc Cluster Size 31
  31. 31. 32 Conclusion ▪ Speed in covering massive search space • In memory (On commodity hardware) • Parallelization ▪ Scales linearly ▪ Huge value in running all of this natively in MemSQL
  32. 32. ▪ Push down the in-memory, proximity filter to each of the leaves ▪ Leverage indexes ▪ Stream results in parallel to Duke Entity Resolution How does MemSQL do it? 33
  33. 33. ▪ Using Metaphone, SoundEx, and Levenshtein algorithms to compare first name, last name and email ▪ Duke supports many more comparisons, and makes it very easy to create new ones ▪ With a training dataset, Duke can use a genetic algorithm to optimize comparator weights ▪ https://github.com/larsga/Duke Duke Entity Resolution 34
  34. 34. Demo
  35. 35. www.memsql.com Thank You

×