Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-Time Analytics with MemSQL and Spark

701 views

Published on

Learn how Pinterest measures real-time user engagement in this technical demonstration that leverages Spark to enrich streaming data with geolocation.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Real-Time Analytics with MemSQL and Spark

  1. 1. Neil Dahlke, Engineer 2016 November 4 Real-Time Analytics with MemSQL and Spark
  2. 2. About Me: Neil Dahlke  Engineer  MemSQL • real-time database for transactions / analytics  Formerly Globus • high performance data transfer for research scientists  Past talks • Real-time, Geospatial, Maps  Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps- by-neil-dahlke
  3. 3. WHAT WE ARE SEEING A WORLD OF CONNECTED MACHINES AND PEOPLE
  4. 4. WHAT WE ARE SEEING: Sensors. Applications. Machines. And us. Generating more data every single day. By 2020, over 20 billion connected things will be in use across a range of industries.
  5. 5. REAL-TIME INPUTS LIVE OUTPUTS Sensors Logs Events Streaming Inserts Upserts Queries Dashboards Business Intelligence Applications Predict Analytics
  6. 6. WHAT DO REAL TIME BUSINESSES NEED? FAST DATA INGEST The volume of data that can be ingested into the database
  7. 7. WHAT DO REAL TIME BUSINESSES NEED? LOW LATENCY QUERIES The time it takes to execute queries and receive results
  8. 8. WHAT DO REAL TIME BUSINESSES NEED? HIGH CONCURRENCY The ability to scale simultaneous operations
  9. 9. WHAT DO REAL TIME BUSINESSES NEED? FAST DATA INGEST The volume of data that can be ingested into the database LOW LATENCY QUERIES The time it takes to execute queries and receive results HIGH CONCURRENCY The ability to scale simultaneous operations
  10. 10. REAL-TIME INPUTS LIVE OUTPUTS Sensors Logs Events Streaming Inserts Upserts Queries Dashboards Business Intelligence Applications Predict Analytics
  11. 11. A massively scalable database and ingest solution allowed for massive growth, real-time analytic applications and faster, targeted. +
  12. 12.  Kafka • Component we kept  S3 • Persisted all logs to cold storage for eventual analysis  Hadoop • Nighly map-reduce jobs  Redshift • Took a full day to load data from previous day • Reaching overlap of times caused data crisis Before
  13. 13.  No real time access to analytics  No SQL interface for analysts and data scientists  Massive nightly Hadoop batch jobs (late data)  Unfiltered and incomplete data (silos)  Expensive Why was this bad for their business operations?
  14. 14. Why was this bad for their data operations?  Too slow  Not scalable  No deduplication • aka not exactly-once  Low concurrency FAST DATA INGEST LOW LATENCY QUERIES HIGH CONCURRENCY
  15. 15. How It Works Now
  16. 16. After
  17. 17. TECHNICAL BENEFITS  Instant accuracy to the latest re-pin  1 GB/sec totaling 72 TB/day THE PINTEREST REAL-TIME ARCHITECTURE REAL-TIME ANALYTICS
  18. 18. Accelerated ingest time by 200,000x 1 GB/sec totaling 72 TB/day RESULTS
  19. 19. Visualizing The Data
  20. 20. 23
  21. 21. Visualizing the Data  Demo built using • Mapbox • Websockets • Tornado web server  When an image is re pinned, the circles on the globe expand, showing higher volume areas  Reads data from MemSQL directly 24
  22. 22. DEMO 25
  23. 23. Questions?
  24. 24. More Info  http://www.odbms.org/blog/2015/04/powering-big-data-at- pinterest-interview-with-krishna-gade/  https://gigaom.com/2015/02/18/pinterest-is- experimenting-with-memsql-for-real-time-data-analytics/  https://www.infoq.com/news/2015/03/pinterest-memsql- spark-streaming  http://blog.memsql.com/pinterest-apache-spark-use-case/  https://engineering.pinterest.com/blog/real-time-analytics- pinterest
  25. 25. Resources  https://github.com/memsql/memsql-spark-connector  http://docs.memsql.com/docs/streamliner-administration  http://docs.memsql.com/docs/pipelines-overview  https://github.com/memsql/memsql-docker-quickstart
  26. 26. Thank You

×