Real-Time Analytics with MemSQL and Spark

317 views

Published on

Learn how Pinterest measures real-time user engagement in this technical demonstration that leverages Spark to enrich streaming data with geolocation.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
317
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • - Distributed In-Memory Database
    - Built for real-time analytics and transactions
    Familiar SQL Interface
    Spark integration out-of-the-box
    - Native Kafka Ingestion

    What did they want to do?

    - highly scalable infrastructure that collects, stores and processes user engagement data in real-time
    higher performance event logging
    Reliable log transport and storage
    ability to query real-time data
  • user clicks Pin or repin

    event is pushed to Apache Kafka

    Storm, Spark and other custom built log readers process these events in real-time

    log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model). 

    self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing

    In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport

    Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time

    A few days later this data would hit Redshift and be queryable



  • - took several days to access analytics and make available to data science team (too late, A/B testing, advertising)

    - no SQL Interface

    - 5.5 M rows / second for one topic, 1.7 M rows / second for another, with the lowest throughput being 132k rows / second

    - data needs to be filtered as well as enriched

    - At LEAST once semantics
  • user clicks Pin or repin

    event is pushed to Apache Kafka

    Storm, Spark and other custom built log readers process these events in real-time

    log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model). 

    self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing

    In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport

    Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time

    A few days later this data would hit Redshift and be queryable
  • Goes both ways
  • easily repeatable success
    days to seconds
    now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands
    Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community
    Cheaper
    the ability to identify (and react to) developing trends as they happen
    provides insight into how users are engaging with Pins across the globe in real-time
    helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams
    fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
  • easily repeatable success
    days to seconds
    now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands
    Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community
    Cheaper
    the ability to identify (and react to) developing trends as they happen
    provides insight into how users are engaging with Pins across the globe in real-time
    helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams
    fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
  • Pull up Ops
    Pull up a terminal and create the database
    Deploy Spark
    Create a Streamliner pipeline
    Create a Pipeline pipeline
    Expose the UI
    Ad-Hoc queries, Tableau, and custom reporting
  • Real-Time Analytics with MemSQL and Spark

    1. 1. Neil Dahlke, Engineer 2016 November 4 Real-Time Analytics with MemSQL and Spark
    2. 2. About Me: Neil Dahlke  Engineer  MemSQL • real-time database for transactions / analytics  Formerly Globus • high performance data transfer for research scientists  Past talks • Real-time, Geospatial, Maps  Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps- by-neil-dahlke
    3. 3. WHAT WE ARE SEEING A WORLD OF CONNECTED MACHINES AND PEOPLE
    4. 4. WHAT WE ARE SEEING: Sensors. Applications. Machines. And us. Generating more data every single day. By 2020, over 20 billion connected things will be in use across a range of industries.
    5. 5. REAL-TIME INPUTS LIVE OUTPUTS Sensors Logs Events Streaming Inserts Upserts Queries Dashboards Business Intelligence Applications Predict Analytics
    6. 6. WHAT DO REAL TIME BUSINESSES NEED? FAST DATA INGEST The volume of data that can be ingested into the database
    7. 7. WHAT DO REAL TIME BUSINESSES NEED? LOW LATENCY QUERIES The time it takes to execute queries and receive results
    8. 8. WHAT DO REAL TIME BUSINESSES NEED? HIGH CONCURRENCY The ability to scale simultaneous operations
    9. 9. WHAT DO REAL TIME BUSINESSES NEED? FAST DATA INGEST The volume of data that can be ingested into the database LOW LATENCY QUERIES The time it takes to execute queries and receive results HIGH CONCURRENCY The ability to scale simultaneous operations
    10. 10. REAL-TIME INPUTS LIVE OUTPUTS Sensors Logs Events Streaming Inserts Upserts Queries Dashboards Business Intelligence Applications Predict Analytics
    11. 11. A massively scalable database and ingest solution allowed for massive growth, real-time analytic applications and faster, targeted. +
    12. 12.  Kafka • Component we kept  S3 • Persisted all logs to cold storage for eventual analysis  Hadoop • Nighly map-reduce jobs  Redshift • Took a full day to load data from previous day • Reaching overlap of times caused data crisis Before
    13. 13.  No real time access to analytics  No SQL interface for analysts and data scientists  Massive nightly Hadoop batch jobs (late data)  Unfiltered and incomplete data (silos)  Expensive Why was this bad for their business operations?
    14. 14. Why was this bad for their data operations?  Too slow  Not scalable  No deduplication • aka not exactly-once  Low concurrency FAST DATA INGEST LOW LATENCY QUERIES HIGH CONCURRENCY
    15. 15. How It Works Now
    16. 16. After
    17. 17. TECHNICAL BENEFITS  Instant accuracy to the latest re-pin  1 GB/sec totaling 72 TB/day THE PINTEREST REAL-TIME ARCHITECTURE REAL-TIME ANALYTICS
    18. 18. Accelerated ingest time by 200,000x 1 GB/sec totaling 72 TB/day RESULTS
    19. 19. Visualizing The Data
    20. 20. 23
    21. 21. Visualizing the Data  Demo built using • Mapbox • Websockets • Tornado web server  When an image is re pinned, the circles on the globe expand, showing higher volume areas  Reads data from MemSQL directly 24
    22. 22. DEMO 25
    23. 23. Questions?
    24. 24. More Info  http://www.odbms.org/blog/2015/04/powering-big-data-at- pinterest-interview-with-krishna-gade/  https://gigaom.com/2015/02/18/pinterest-is- experimenting-with-memsql-for-real-time-data-analytics/  https://www.infoq.com/news/2015/03/pinterest-memsql- spark-streaming  http://blog.memsql.com/pinterest-apache-spark-use-case/  https://engineering.pinterest.com/blog/real-time-analytics- pinterest
    25. 25. Resources  https://github.com/memsql/memsql-spark-connector  http://docs.memsql.com/docs/streamliner-administration  http://docs.memsql.com/docs/pipelines-overview  https://github.com/memsql/memsql-docker-quickstart
    26. 26. Thank You

    ×