Data Stack Summit 2023

Subramanya Mulgund
Manimuthu Ayyannan
Self Service Metadata driven
Data Loader Platform

About Us
Manimuthu Ayyannan
manimuthu.ayyannan@walmart.com
LinkedIn:@manimuthuayyannan
Senior Manager II, Personalization at
Walmart Global Tech
Subramanya Mulgund
subramanya.mulgund@walmart.com
LinkedIn:@mulgunds
Sr Software Engineer, Personalization at
Walmart Global Tech

Agenda
• Personalization @Walmart
• Challenges
• Solution Approaches
• High Level System Architecture
• Metadata Design and Connectors
• Orchestrator
• Schedule Optimizer
• Telemetry

Personalization @Walmart
• Our Customers are becoming increasingly
omni channel
• ~220M Customers & Members visits ~10,500
stores & clubs under 46 banners in
24 countries & eCommerce websites in a
week
• Billions of product impressions served every
week which generates events in petabytes
• We at FE team, run thousands of data
applications to generate features that
powers the personalized recommendations
to our customers
source
Walmart
General
Merchandise
+Walmart
Grocery, Store
Pickup &
Delivery
+Walmart
Stores

• Data application onboarding requires a lot of manual hand coding and developers need
time to develop, integrate, and test code to solve the underlying complexities
• Building functionality rich application needs integration with various big data technologies,
wide array of data sources, sinks and data processors
• Isolated deployment, difficult to control the resource allocation/usage and do the
retrospection
• Competing high and low priority applications are introducing the latency to the serving
layers
Challenges

Challenges | New App Onboarding | Cumbersome & Fragile
Integrate
Application 1 Integrate Develop Implement Enable
Source System Target System Processor Schedule Telemetry
Test and Deploy
Integrate
Application 2 Integrate Develop Implement Enable Test and Deploy
Integrate
Integrate
Integrate
Application N Integrate Develop Implement Enable Test and Deploy
Allocate
Resource
Allocate
Allocate
Allocate
Allocate

Data Loader Simplifies the onboarding
Configure
Application 1
Source System Target System Processor Schedule Telemetry
Test and Deploy
Configure
Application 2 Test and Deploy
Configure
Configure
Configure
Application N Test and Deploy
Resource
Parsers Connectors
Processors Schedulers
Execution Plan
Dashboard
Data Loader Platform

• A centralized metadata driven data loading platform with plug and play
onboarding capability
• An abstraction layer to build the workflow orchestration which simplifies the complex
service integrations and faster time to deployment
• A compelling UI that dramatically increases the developer’s productivity by providing ready-
to-use connectors to configure the business logic
• An Intelligent system to provide optimized recommendation based on the previous runs
• Smart run schedule pool to enqueue and dequeue the run instances based on priority
Solution Approach

High Level System Architecture

•Platform is equipped to parse and handle all the data formats like JSON, AVRO,
Parquet and CSV
•Users can pick the existing connectors supporting different source and target systems
like Kafka, Cassandra and BQ.
•Metadata stores the system and application specific resource configuration to
optimize the resource allocations
•Abstract layer bundled with Custom UDFs that provides user flexibility to query the systems
like Kafka and Cassandra with SQL
Connectors

Sample Domain API call in SQL UDF
• Accessing new domain APIs requires lot of engineering effort to integrate it in any data
applications
• Creating UDFs for Domain APIs and use these APIs in parallel computational engine
like Spark where it accepts UDFs usage in SQL
spark.sql("select getAccountStatus('cust_id:xxxxxxxxx') as is_active from table limit 1").show(false)
+------------------------------+
|is_active |
+------------------------------+
|Y|
+------------------------------+

Orchestrator
• Builds the optimized execution plan
based on the application configs from
the metadata store
• Responsible for generating the run
instances based on the app priority and
source systems
• Executors picks the optimized
execution plan during the execution
Metadata
Store
Executors
Read App
Config
Job Optimizer
Generate Run
Instance
Run Scheduler
Orchestrator

• Smart priority groups assigned to each loader for all the applications based on
the criticality
• Top priority jobs take precedence over the already scheduled lower priority
ones by dequeuing them
• Automatic resumption of the lower priority jobs once all the top priority and
SLA bound jobs are complete
Schedule Optimizer

• Real-time dashboards that provide run time statistics for each application
• Insightful experience to deep dive on various metrics
• Alerting and notification mechanism to let app owners know about any erroneous or fault
scenarios
• Consolidated view of all applications with corresponding success/failure ratio
Telemetry

Putting the pieces together
Self Service
Metadata Store
Multiple
Execution
Engines
E2E App Life
Cycle
Management
Multiple
Source &
Target Systems
Telemetry
Version Control
& CI/CD
Cloud Native
Plug & Play
Low or No code

• Quick turnaround time from weeks to days
• Developer productivity expected to increase by multiple folds
• Non-Engineering teams can also leverage this Platform to build functional applications
with knowledge of SQL
• Intelligent app execution based on the app priority compared to non-SLA applications
Outcome

Data Stack Summit 2023

Recommended

Recommended

More Related Content

Similar to Data Stack Summit 2023

Similar to Data Stack Summit 2023 (20)

Recently uploaded

Recently uploaded (20)

Data Stack Summit 2023

Editor's Notes