Data Infrastructure at Flipkart (VLDB 2016)

•

2 likes•1,424 views

Challenges in building and scaling data infrastructure at Flipkart. Primitives and abstractions to address the challenges.

Data & Analytics

Data Infrastructure
@ Flipkart
VLDB 2016 Delhi
Sharad Agarwal

search
browse
order
ship
deliver
catalog
review & ratings
sellers
personalisation
recommendation
pricing
offers
delivery promise
demand forecast
inventory
wave planning
transport planning
Route planning
Data In Ecommerce

Reporting
Sales- e.g. Orders by Products, Geo
Logistics - 90% percentile delivery times
Learning
User affinity to a product, price
Demand forecast of products by geo
Realtime Operational Reporting
Congestion points in supply chain
Demand shaping with product offers and serviceability
Adhoc Analysis
find causes of returns in a category
Data Applications

Data preparation is huge task
OLTP model not conducive to analysis
Challenges as we grow
Data get silo-ed in multiple systems
Process large data and in realtime
Challenges with traditional
approach

Data no more an
afterthought
How do we solve ?

Standardised data definitions
All data backed by strong schema defined before
ingestion
Instrumentation & ingestion as part of SDLC
Central data platform
abstract data applications from data infra
complexities
ensures data quality - completeness, correctness
scalable to support ingestion, processing , reporting
in various flavours
Flipkart Data approach

Majority of tech challenges
now moved to data infra

Support very different workloads
Canned & scheduled
Adhoc & interactive
Realtime and batch
Conducive for systemic and human
consumption
Self serve
Reliability with variety of workloads
at scale is non-trivial
Challenges in central infra

Entity and Events schemas - versioned
Atleast once semantics - entity data
logged as part of transaction
Facts - realtime and batch
Pipeline - Uses Hive, MR, Spark, Storm
Unified query and reporting layer
across Hadoop, Vertica and Elastic-
search
Primitives

Higher level abstraction for stream
to stream joins
Aggregations at query time
Mutable query store - ES
Pipeline - Storm
Generated streams can be consumed
by other systems
Realtime

- Terabytes of data generated in a day
- Billions of raw events in a day
- Thousands of raw data streams
- Petabyte of data processed in a day
- Thousands of Hadoop jobs run in a day
- Thousands of Report views in a day
Order of Scale

Proliferation in no of pipelines and
reports with huge overlap
How to measure Quality of data ?
How long to store different kind of
data? Forever ?
Who owns the dataset ? Who is
responsible for maintaining data
freshness ? …
How do we incentivise right behaviour ?
As adoption grew: further
Challenges

Distributed Data frame - Abstraction to
represent a fully managed dataset (no
relation to spark df or R df)
Abstracts physical representation - both
streaming or batch data
Natively built data quality measures:
Correctness
Completeness
Freshness
DDF

Supports schema evolution with
versioning
Access control policies
Dependency and Lifecycle
management policies
Discovery - schema field, quality
and usage aware
Backup and restore policies
DDF …

Flow
Data pipeline that transforms, enrich or
aggregate one or more ddfs producing a ddf
ddf = fn(ddfs)
flow
hive
storm
MR
spark
…

Resource Management:
Quotas and limits to become fully
self serve and ensure Quality of
Service
Further work

Data as part of SDLC
Central platform abstracts data stack
complexities
Higher level constructs for stream to
stream joins
Schema evolution and change management
Data quality is not just schema adherence
Resource management for ensuring
reliability and quality of service
Summary

What's hot

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit

Resume xiaodan(vinci)vinci105

5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit

Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics

Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland

II-SDV 2016 VantagePointDr. Haxel Consult

Sigma EE: Reaping low-hanging fruits in RDF-based data integrationRichard Cyganiak

Towards Personalization in Global Digital HealthDatabricks

Data analysis@network programmingRama .

Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

IPC Data Analysis and Extractionpzybrick

Why Marketing Should Consider Agile Modern Data Delivery Platformsyed_javed

II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirDr. Haxel Consult

Best analytics toolRitu Sarkar

Quantitative anthropology hackuariumJonathan Sobel

Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...Dataconomy Media

How Starbucks Forecasts Demand at Scale with Facebook Prophet and DatabricksNavin Albert

Atlas ApacheCon 2017Vimal Sharma

Importance of ML Reproducibility & Applications with MLfLowDatabricks

What's hot (20)

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...

Resume xiaodan(vinci)

5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri

Batter Up! Advanced Sports Analytics with R and Storm

Real-time Recommendations for Retail: Architecture, Algorithms, and Design

II-SDV 2016 VantagePoint

Sigma EE: Reaping low-hanging fruits in RDF-based data integration

Towards Personalization in Global Digital Health

Data analysis@network programming

Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...

Jeeves Grows Up: An AI Chatbot for Performance and Quality

IPC Data Analysis and Extraction

Why Marketing Should Consider Agile Modern Data Delivery Platform

II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air

Best analytics tool

Quantitative anthropology hackuarium

Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...

How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks

Atlas ApacheCon 2017

Importance of ML Reproducibility & Applications with MLfLow

Viewers also liked

NYC Lucene/Solr Meetup: Spark / Solrthelabdude

Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad

Search@airbnbMousom Gupta

Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks

Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Lucidworks

Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks

DDD Framework for Java: JdonFrameworkbanq jdon

Derivatives in graphing-dfsFarhana Shaheen

All analysisAmber_

Annie Savant adubose

Taylor Cutreradubose

Freelance Translator 2.0Mike Sekine

Recruitment Process OutsourcingKapilKumar0111

Taylor Smithadubose

Mean median mode_rangeFarhana Shaheen

Beginning gl.enchantRyo Shimizu

Amber’s final magazineAmber_

Detoxifying your body dfsFarhana Shaheen

Disposição das equipes jc 2013Major Ribamar

Viewers also liked (20)

NYC Lucene/Solr Meetup: Spark / Solr

Query Understanding at LinkedIn [Talk at Facebook]

Search@airbnb

Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera

Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart

DDD Framework for Java: JdonFramework

Derivatives in graphing-dfs

All analysis

Annie Savant

Taylor Cutrer

Freelance Translator 2.0

Recruitment Process Outsourcing

Taylor Smith

Mean median mode_range

Beginning gl.enchant

Amber’s final magazine

Detoxifying your body dfs

Disposição das equipes jc 2013

Similar to Data Infrastructure at Flipkart (VLDB 2016)

Hadoop - An IntroductionShankar R

Solving Big Data Problems using Hortonworks DataWorks Summit/Hadoop Summit

IoT Crash Course Hadoop Summit SJDaniel Madrigal

Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info

KELLY_MANOVERV.PDFHernanKlint

Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies

Big Data and User Segmentation in Mobile ContextInMobi Technology

Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16AppDynamics

Spark Summit Keynote by Suren NathanSpark Summit

Introduction Big DataFrank Kienle

Datawarehousingwork

Big data presentationandoverview_of_couchbaseAMAR NATH

Data Governance for Data LakesKiran Kamreddy

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

Seminaire bigdata23102014Raja Chiky

Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell

CWIN17 India / Bigdata architecture yashowardhan sowaleCapgemini

BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...Big Data Week

TSE_Pres12.pptxssuseracaaae2

UTAD - Jornadas de Informática - Potential of Big DataMarco Silva

Similar to Data Infrastructure at Flipkart (VLDB 2016) (20)

Hadoop - An Introduction

Solving Big Data Problems using Hortonworks

IoT Crash Course Hadoop Summit SJ

Big data: Descoberta de conhecimento em ambientes de big data e computação na...

KELLY_MANOVERV.PDF

Data Warehouse Modernization: Accelerating Time-To-Action

Big Data and User Segmentation in Mobile Context

Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16

Spark Summit Keynote by Suren Nathan

Introduction Big Data

Datawarehousing

Big data presentationandoverview_of_couchbase

Data Governance for Data Lakes

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

Seminaire bigdata23102014

Big Data and BI Tools - BI Reporting for Bay Area Startups User Group

CWIN17 India / Bigdata architecture yashowardhan sowale

BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...

TSE_Pres12.pptx

UTAD - Jornadas de Informática - Potential of Big Data

Recently uploaded

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Industrialised data - the key to AI success.pdfLars Albertsson

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Ravak dropshipping via API with DroFx.pptxolyaivanovalion

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

B2 Creative Industry Response Evaluation.docxStephen266013

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

April 2024 - Crypto Market Report's Analysismanisha194592

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Halmar dropshipping via API with DroFxolyaivanovalion

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Week-01-2.ppt BBB human Computer interactionfulawalesam

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

Industrialised data - the key to AI success.pdf

Customer Service Analytics - Make Sense of All Your Data.pptx

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Ravak dropshipping via API with DroFx.pptx

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Call Girls In Mahipalpur O9654467111 Escorts Service

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

B2 Creative Industry Response Evaluation.docx

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

April 2024 - Crypto Market Report's Analysis

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Halmar dropshipping via API with DroFx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Week-01-2.ppt BBB human Computer interaction

Log Analysis using OSSEC sasoasasasas.pptx

BigBuy dropshipping via API with DroFx.pptx

Data Infrastructure at Flipkart (VLDB 2016)

1. Data Infrastructure @ Flipkart VLDB 2016 Delhi Sharad Agarwal

2. search browse order ship deliver catalog review & ratings sellers personalisation recommendation pricing offers delivery promise demand forecast inventory wave planning transport planning Route planning Data In Ecommerce

3. Reporting Sales- e.g. Orders by Products, Geo Logistics - 90% percentile delivery times Learning User affinity to a product, price Demand forecast of products by geo Realtime Operational Reporting Congestion points in supply chain Demand shaping with product offers and serviceability Adhoc Analysis find causes of returns in a category Data Applications

4. Traditional Approach

5. Data preparation is huge task OLTP model not conducive to analysis Challenges as we grow Data get silo-ed in multiple systems Process large data and in realtime Challenges with traditional approach

6. Data no more an afterthought How do we solve ?

7. Standardised data definitions All data backed by strong schema defined before ingestion Instrumentation & ingestion as part of SDLC Central data platform abstract data applications from data infra complexities ensures data quality - completeness, correctness scalable to support ingestion, processing , reporting in various flavours Flipkart Data approach

8. Data platform

9. Majority of tech challenges now moved to data infra

10. Support very different workloads Canned & scheduled Adhoc & interactive Realtime and batch Conducive for systemic and human consumption Self serve Reliability with variety of workloads at scale is non-trivial Challenges in central infra

11. Entity and Events schemas - versioned Atleast once semantics - entity data logged as part of transaction Facts - realtime and batch Pipeline - Uses Hive, MR, Spark, Storm Unified query and reporting layer across Hadoop, Vertica and Elastic- search Primitives

12. Higher level abstraction for stream to stream joins Aggregations at query time Mutable query store - ES Pipeline - Storm Generated streams can be consumed by other systems Realtime

13. - Terabytes of data generated in a day - Billions of raw events in a day - Thousands of raw data streams - Petabyte of data processed in a day - Thousands of Hadoop jobs run in a day - Thousands of Report views in a day Order of Scale

14. Proliferation in no of pipelines and reports with huge overlap How to measure Quality of data ? How long to store different kind of data? Forever ? Who owns the dataset ? Who is responsible for maintaining data freshness ? … How do we incentivise right behaviour ? As adoption grew: further Challenges

15. Distributed Data frame - Abstraction to represent a fully managed dataset (no relation to spark df or R df) Abstracts physical representation - both streaming or batch data Natively built data quality measures: Correctness Completeness Freshness DDF

16. Supports schema evolution with versioning Access control policies Dependency and Lifecycle management policies Discovery - schema field, quality and usage aware Backup and restore policies DDF …

17. All data in platform as ddf

18. Flow Data pipeline that transforms, enrich or aggregate one or more ddfs producing a ddf ddf = fn(ddfs) flow hive storm MR spark …

19. Resource Management: Quotas and limits to become fully self serve and ensure Quality of Service Further work

20. Data as part of SDLC Central platform abstracts data stack complexities Higher level constructs for stream to stream joins Schema evolution and change management Data quality is not just schema adherence Resource management for ensuring reliability and quality of service Summary

21. Thanks! sharad@apache.org

Data Infrastructure at Flipkart (VLDB 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Infrastructure at Flipkart (VLDB 2016)

Similar to Data Infrastructure at Flipkart (VLDB 2016) (20)

Recently uploaded

Recently uploaded (20)

Data Infrastructure at Flipkart (VLDB 2016)