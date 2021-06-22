Successfully reported this slideshow.
Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021

The more time the data science teams spend on model training, the less business value is added because no value is created until that model is deployed in production. Traditional HDD-based systems are not suitable for training, which is very IO intensive due to complex transformations that are involved during data preparation. Moreover, Training is not a one-time process. Trends and patterns in the data keep changing rapidly, hence models need to be retrained to address drift issues to continually improve performance in production. Data scientists often experiment with thousands of models, and speeding up the process has significant business implications.

In this talk, we will cover how you can accelerate an AI/ML pipeline by speeding up data loads using the Aerospike database which leverages its hybrid memory architecture to achieve sub-millisecond read/writes. In a hybrid memory architecture the index is stored in-memory (not persisted), and data is stored on persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index. For time-sensitive and high throughput use cases such as fraud detection, you need a transactional database at the edge that can handle high-velocity ingestion and support millions of IOPS. The events are then streamed downstream to your AL/ML platform for training or your inference server for predictions. We will share the reference architecture of a highly performant AI/ML training and inference pipelines consisting of Apache Pulsar, Apache Spark 3.0, Aerospike database, and its Spark and Pulsar connectors. This architecture can be extended to other use cases that demand low latency and high throughput while not blowing your budget.

  1. 1. Pulsar Virtual Summit North America 2021 Kiran Matty Director of Product Management Aerospike
  2. 2. 2 Pulsar Virtual Summit North America 2021 ▪ Director of Product for Ecosystem @ Aerospike ▪ Domain experience spans Big Data Infrastructure and Data Security @ Visa, Hortonworks, and Cisco ▪ Interests include large scale distributed systems and AI/ML ▪ Lego builder in spare time whoami
  3. 3. 3 Pulsar Virtual Summit North America 2021 Source: Google I/O 2018 Training can take Forever… TRAINING TIME Minutes – hours 1 - 4 Days 1 - 4 Weeks > 1 month
  4. 4. 4 Pulsar Virtual Summit North America 2021 Source: Micron ▪ Traditional HDD based systems are not suitable for Training ▪ Model need to be retrained to address data /Model drift AI/ML needs Hybrid Storage
  5. 5. 5 Pulsar Virtual Summit North America 2021 AI/ML needs memory-like access at Petabyte scale with lower TCO
  6. 6. 6 Pulsar Virtual Summit North America 2021 Why do other databases fall short?
  7. 7. Pulsar Virtual Summit North America 2021 High Frequency Trading IIoT / Predictive Maintenance Aerospike Drives data-driven decisioning use cases is it fresh Fraud Detection Personalization/Customer 360o AdTech Real Time Bidding
  8. 8. 8 Pulsar Virtual Summit North America 2021 CLOUD / ON-PREM 8 CONNECT for Spark Python Client COMPUTE STORAGE NOTEBOOK & ML PACKAGES CONTAINER PLATFORM A Blueprint for AI/ML CONNECT for Pulsar
  9. 9. 9 Pulsar Virtual Summit North America 2021 Why Pulsar? Durability Scalability Geo-Replication Multi-Tenancy Unified Messaging Model
  10. 10. 10 Pulsar Virtual Summit North America 2021 Mapping Aerospike <> Pulsar Data models Aerospike RDBMS Pulsar Namespace Database Topic Set(optional) Table Topic Record Row Record Bin Column Fields (based on schema) Key Key Key Mapping is via YAML files.
  11. 11. 11 Pulsar Virtual Summit North America 2021 Pub/Sub API Pub/Sub API Reader and Batch API Pulsar IO/Connectors Stream Processor Applications Prebuilt Connectors Custom Connectors Aerospike Sink Connector* Microservices or Event-Driven Architecture Publisher Aerospike Source Connector Subscriber Aerospike Connect for Pulsar IOT/edge devices Change Notification: {"metadata":{"namespace":"device","set":"streaming_write_set" ,"digest":"SH0QwiJxdW5Wkf/hAVJGn7Sw37U=","msg":"write","ge n":38,"lut":0,"exp":0},"three":37089,"two":"two_89","one":37089} Change Notification s *Not GA’d Schema Registry
  12. 12. 12 Pulsar Virtual Summit North America 2021 Data Preparation Model Training Third Party Data Exploratory Data Analysis Parameter Tuning Data Scientist Model Validation MODEL SERVING Speeding up Training Pipeline (Conceptual View) CONNECT for Spark Aerospike Database System of Record AI/ML Platform ML Application HTTP 1 2 4 3
  13. 13. 13 Pulsar Virtual Summit North America 2021 Real-time Inference (Conceptual View) Edge Systems across Datacenters Data Preparation HTTP Model Serving Predictions ML Application Predictions Aerospike Database Core System Streaming Source CONNECT for Pulsar CONNECT for Pulsar Application Specialist Aerospike Database Edge Location 1 Aerospike Database Edge Location n XDR CONNECT for Spark HTTP API API API Pulsar Spark Connector
  14. 14. 14 Pulsar Virtual Summit North America 2021 Massive Parallelization ✔80% reduction in Spark Job Execution time ✔Reduced training time ✔Increase frequency of retraining Operational reliability at extreme scale ✔13B Objects ✔150 TB unique data – multiple times a day Increased ROI ✔Only 33 Aerospike servers ✔Increased utilization of Spark Cluster (300 nodes and 7,500 cores) Massive Parallelism w/ Aerospike and Spark CASE STUDY: “We were using custom code before which led to data quality issues and a complex data infrastructure. With Aerospike, we are processing Spark jobs that used to take 12 hours now in just 2.4. Senior Director, Data Science and Engineering Top Global Ad Tech company GLOBAL AD TECH COMPANY
  15. 15. 15 Pulsar Virtual Summit North America 2021 Execute Spark jobs faster with massive parallelism 1. Reduce Training Time 3. Increase Frequency of Re-Training Conduct in-place data exploration Create low latency and high throughput streaming pipeline 1 2 3 The Aerospike Difference for AI/ML Eliminate compliance headaches by removing the need to copy data into multiple systems “Aerospike is second to none for ingesting and persisting millions of events per second… (Aerospike) allows me to do near-instantaneous machine learning on the data as it lands.” Theresa Melvin Chief Architect of AI-Driven Big Data Solutions, HPE 2. Maximize ROI Aerospike data platform connects readily to Spark and Pulsar
  16. 16. 16 Pulsar Virtual Summit North America 2021 Thank you We are hiring for our India and the US offices. https://aerospike.com/solutions/use-cases/ai-ml/

