Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modeling the Smart and Connected City of the Future with Kafka and Spark

1,677 views

Published on

Eric Frenkiel, CEO & Co-Founder, MemSQL at Strata + Hadoop World Singapore

Published in: Data & Analytics
  • Be the first to comment

Modeling the Smart and Connected City of the Future with Kafka and Spark

  1. 1. Modeling the Smart and Connected City of the Future with Kafka and Spark Eric Frenkiel, CEO & Co-Founder, MemSQL @ericfrenkiel MAKE DATA WORK DECEMBER 1-3, 2015  SINGAPORE
  2. 2. 2 MemSQL at a Glance Enterprise Focused  Our Mission:  Real-time database for transactions and analytics  Founded in 2011, based in San Francisco  Founders are former Facebook, SQL Server database engineers  $50 million in funding to date Make every company a real-time enterprise.
  3. 3. What does a Smart City Look Like?
  4. 4. 4 Our Conception
  5. 5. 5 Our Reality
  6. 6. 6 3.9b people live in cities today
  7. 7. 7 By 2050, we’ll add another 2.5b people
  8. 8. 8 We need to create sustainable cities
  9. 9. 9 We need to use technology to help us
  10. 10. 10 We don’t live in Tomorrowland
  11. 11. 11 We live here
  12. 12. The good news: the Technology of Today can build smart cities. 12
  13. 13. 13  City-wide WiFi  City App to report issues  Open-Data Initiatives to share data with the public  Most importantly, an adaptive IT department A Smart City Should Have…
  14. 14. 14 Let’s learn how.
  15. 15. A Model Application: MemCity Capturing data from 1.4 million households Total AWS hardware costs at $2.35 per hour
  16. 16. MemCity Reach 1.4 million households (approximately the size of Chicago)
  17. 17. Capturing data from 8 devices in each home, every minute * #MemCity
  18. 18. 186,667 transactions per second from Kafka Spark MemSQL #MemCity
  19. 19. 1.4 Million Households 8 Devices per Household 186K Events per Second
  20. 20. The “Real-Time Trinity”
  21. 21. Designing the Ideal Real-Time Pipeline Message Queue Transformation Speed/Serving Layer End-to-End Data Pipeline Under One Second 21
  22. 22.  A high-throughput distributed messaging system  Publish and subscribe to Kafka “topics”  Centralized data transport for the organization Kafka 22
  23. 23.  In-memory execution engine  High level operators for procedural and programmatic analytics  Faster than MapReduce Spark 23
  24. 24.  In-memory, distributed database  Full transactions and complete durability  Enable real-time, performant applications MemSQL 24
  25. 25. Subscribing to Kafka (2015-07-06T16:43:40.33Z, 329280, 23, 60) 0111001010101111101111100000001010 111100001110101100000010010010111… Publish to Kafka Topic 0111001010101111101111100000001010 111100001110101100000010010010111… 1110010101000101010001010100010111 111010100011110101100011010101000… 0101111000011100101010111110001111 011010111100000000101110101100000… Event added to message queue 25
  26. 26. Enrich and Transform the Data Spark polling Kafka for new messages (2015-07-06T16:43:40.33Z, 329280, 23, 60) (2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60) Deserialization Enrichment 0111001010101111101111100000001010 111100001110101100000010010010111… 26
  27. 27. Persist and Prepare for Production RDD.saveToMemSQL() INSERT INTO memcity_table ... time house_id zip device _id device_type watts 2015- 07- 06T16:4 3:40.33 Z 329280 94110 23 ‘kitchen_app liance’ 60 … … … … … … 27
  28. 28. Go to Production Compress development timelines SELECT ... FROM memcity_table ... 28
  29. 29. We can use In-Memory technology to build interactive applications for Cities.
  30. 30. 31  Urban planning  Efficient power consumption  Efficient transportation  Sustainable energy practices So We Can Optimize…
  31. 31. Creating Real-Time Pipelines should be push button easy. 32
  32. 32.  One click deployment of integrated Apache Spark  Put Spark in the Fast Lane • GUI pipeline setup • Multiple data pipelines • Real-time transformation  Eliminates batch ETL  Open source on GitHub MemSQL Streamliner for IoT Applications 33
  33. 33. Simple Deployment Process Application 34
  34. 34. Cluster 1. Deploy MemSQL In-Memory | Distributed | Relational Application 35
  35. 35. Cluster 2. Deploy Spark Application 36
  36. 36. Cluster Kafka Connects to Each Node Application 37
  37. 37. Streamliner Architecture First of many integrated Apache Spark solutions Other Real-Time Data Sources Application Apache Spark Future Solution Future Machine Learning Solution STREAMLINER 38
  38. 38. Streamliner ETL Detail Other Real-Time Data Sources Application Apache Spark Future Solution Future Machine Learning Solution STREAMLINER Custom Future Extractor JSON Custom Future Transformer STREAMLINER Extract Transform Load 39
  39. 39. Streamliner 40
  40. 40. Extract 41
  41. 41. Transform 42
  42. 42. Load 43
  43. 43. Extending Analytics with Lambda Architecture Real-Time Analytics Streaming Analytic Applications Not Excel Reports  Financial Services  Adtech  eCommerce  IoT  Consumer Internet  Energy  Federal Lambda Architecture New Real-Time Processing Existing Batch Processing Msg Queue
  44. 44. 45  Multi-TB on commodity hardware  Store the “state of the model”  Easily build applications  Avoid direct disk at all cost In-Memory Databases Rise Up
  45. 45. Comprehensive Architecture Transactions 46
  46. 46. Comprehensive Architecture Real Time Speed/Streaming Layer Fast Updates Rowstore Transactions 47
  47. 47. Comprehensive Architecture Real Time Speed/Streaming Layer Fast Updates Rowstore Analytics Transactions 48
  48. 48. Comprehensive Architecture Real Time Speed/Streaming Layer Fast Updates Rowstore Historical Batch Layer Fast Appends Columnstore Analytics Transactions 49
  49. 49. Comprehensive Architecture Real Time Speed/Streaming Layer Fast Updates Rowstore Historical Batch Layer Fast Appends Columnstore Analytics Transactions Execution engine that spans the data spectrum 50
  50. 50. Comprehensive Architecture Real Time Speed/Streaming Layer Fast Updates Rowstore Historical Batch Layer Fast Appends Columnstore Analytics Transactions 51
  51. 51. Simplified Lambda Architectures with MemSQL Layer Traditional Lambda MemSQL Lambda Batch Hadoop MemSQL Column Store Speed Storm, Spark Kafka > Spark > MemSQL Serving Cassandra, HBase MemSQL 52
  52. 52. Lambda Applies to Real-Time Data Pipelines Message Queue Batch Inputs DatabaseTransformation Application 53
  53. 53. Kafka, Spark, and MemSQL Make it Simple Batch Inputs Application 54
  54. 54. Massive Ingest and Concurrent Analytics 55  Instant accuracy to the latest repin  Build real-time analytic applications  1 GB/sec totaling 72 TB/day Real-time analytics
  55. 55. Using Real-Time for Personalization Ad Servers EC2 Real-time analytics PostgreSQL Legacy reports Monitoring S3 (replay) HDFS Data Science Vertica Star Schema MictoStrategy  Reach overlap and ad optimization  Over 60,000 queries per second  Millisecond response times 56
  56. 56. 57 300k events/sec Reduced Latency from 30 minutes to Sub-Second Real-time Analytics
  57. 57. Sample Pipeline: Analyzing Twitter Data in Real Time ApplicationApache Spark SPARK STREAMLINER Public API “Garden Hose” </> Python Extract Transform Load SPARK STREAMLINER 58
  58. 58. Install MemSQL and Apache Spark in < 1min With MemSQL Ops and Streamliner 59
  59. 59. Run Kafka in Docker Container and Create a New Topic: TWITTER 60
  60. 60. Fill Out Extract, Transform and Load Details to Set Up Pipeline 61
  61. 61. Use Python Script to Load Tweets into Kafka Topic and Get Data Flowing 62
  62. 62. Connect to MemSQL Database and Run SQL Queries Instantly 63
  63. 63. Run Online Alter Table to Optimize Query Performance 64
  64. 64. Streamliner: Dynamic Resource Management Without Streamliner With Streamliner Pipeline 1 Spark Worker Pipeline 2 Spark Worker Executor (P2 only) Executor (P2 only) Executor (P1 only) Executor (P1 only) Driver (P1 only) Driver (P2 only) All Pipelines Streamliner Driver … … Spark WorkerSpark Worker Executor (P1 or P2) Executor (P1 or P2) Executor (P1 or P2) Executor (P1 or P2) 65
  65. 65. Building Real-Time Data Pipelines and Predictive Applications 66
  66. 66. Adding Real-Time Scoring to Predictive Applications Streamliner Input User Jar SAS Generated PMML Industrial Equipment Sensor Data S1 S2 S3 P1 P2 P3 Scoring Real-Time Data with Predictive Models Sensor 1 Predictive Model 1 67
  67. 67. 68
  68. 68. GET YOUR FREE COPY: memsql.com/oreilly 69

×