Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lambda Processing for Near Real Time Search Indexing at WalmartLabs: Spark Summit East talk by Snehal Nagmote


Published on

At WalmartLabs, millions of product information and new products are getting ingested every day. In quest of providing a seamless shopping experience for our customers, we developed near real time indexing data pipeline. Our pipeline is a key component to update dynamically changing product catalog and other features such as store and online availability, offers etc.

Our indexing component, which is based on Spark Streaming Receiver Approach, consumes events from multiple Kafka topics such as Product Change, Store Availability, and Offer Change and merges the transformed Product Attributes with the historical signals computed by relevance data pipeline stored in Cassandra. This data is further processed by another Streaming component, which partitions documents into Kafka topic for every shard as it can be indexed into Apache Solr for Product Search. Deployment of this pipeline is automated end to end.

Published in: Data & Analytics
  • Be the first to comment

Lambda Processing for Near Real Time Search Indexing at WalmartLabs: Spark Summit East talk by Snehal Nagmote

  1. 1. Lambda Processing for Near Time Search Indexing Snehal Nagmote - @WalmartLabs
  2. 2. WalmartLabs Usecase Why Lambda Processing NRT Architecture Overview Implementation Monitoring Spark Application Tuning Lessons Learnt
  3. 3. Product Categorization Shipping Logistics Offers Price Adjustments Use Case: Product Search Indexing Supplier/ Merchants/ Sellers Item Setup Ecommerce Search
  4. 4. Use Case: Near Real Time Indexing Improve Customer experience• Update Product Information• Index new Productso Product Attribute changeo Product Offer (Online availability) eventso 86• million Product Change events/day 1• product -> 5000 stores Store A• vailability Change Events ~ 20 K events/sec
  5. 5. Motivation For Spark • Offline/Full Indexing – Integration with Spark Batch Job • To maintain the same code base/logic to ease debugging • Potentially Leverage same technology stack for Batch and Streaming
  6. 6. Challenges • Merge real time data with historic signals data updated at different frequency. • Update the latest value of attribute from multiple pipeline updates • Dynamic configuration update in Streaming component • Manage Start/Stop Spark Streaming components
  7. 7. Product Attributes Real time streaming attributes (60+)  Availability  Offers (lowest price)  Product title  Product Reviews  Product description … Batch Computed Attributes (20+)  Item score  Facets
  8. 8. Historic data computed by batch pipeline stored in Cassandra Automatic management of latest version of data fields Merge real time data with historic signals to compute complete dataset Lambda Architecture Processing Overview
  9. 9. Lambda Merge
  10. 10. Indexing Data Pipeline
  11. 11.  Reprocessing ?  Event Ordering ?  Synchronization of Configuration Update ?  Start/Stop Streaming Component?  Orchestration with Full Index Update ? Implementation
  12. 12. Streaming Component Interaction  Spark Streaming Receiver Approach  Multiple Kafka Streams processing  Store offsets in Zookeeper  Kafka Partitions by ID
  13. 13. Monitoring  Extended Spark Metrics Api  Register Custom Accumulators/Gauges for key metrics  Kafka Consumer Lag with Custom Scripts  Grafana Dashboard for Visualization
  14. 14. Tuning • Scheduling delay = 0 • Partition RDDs effectively – In terms of multiple of spark workers • Coalesce over repartition • spark.streaming.backpressure.enabled • spark.shuffle.consolidateFiles
  15. 15. Lessons Querying Cassandra  Worst : Filter on Spark side sc.cassandraTable().filter(partitionkey in keys)  Bad : Filter on C* side in single operation sc.cassandraTable().where(keys in productIds) Similar to “in” Query Clause Query : Select * from my_keyspace.users where id in (1,2,3,4)  Best : Filter on C* side in distributed and Concurrent fashion KafkaRDD.joinwithcassandraTable()
  16. 16. Little more about In Clause Multiple Requests: “In” Clause Failure Scenario Img src:
  17. 17. Lessons  Spark Locality Wait Avoid ANY spark.locality.wait = 3s  Connection Keep Alive Spark.cassandra.connection.keep_alive_ms  Cache RDDs !!
  18. 18. Thank You ! Questions ? - Snehal Nagmote @ WalmartLabs