SlideShare a Scribd company logo
1 of 21
Data Infrastructure
@ Flipkart
VLDB 2016 Delhi
Sharad Agarwal
search
browse
order
ship
deliver
catalog
review & ratings
sellers
personalisation
recommendation
pricing
offers
delivery promise
demand forecast
inventory
wave planning
transport planning
Route planning
Data In Ecommerce
Reporting
Sales- e.g. Orders by Products, Geo
Logistics - 90% percentile delivery times
Learning
User affinity to a product, price
Demand forecast of products by geo
Realtime Operational Reporting
Congestion points in supply chain
Demand shaping with product offers and serviceability
Adhoc Analysis
find causes of returns in a category
Data Applications
Traditional Approach
Data preparation is huge task
OLTP model not conducive to analysis
Challenges as we grow
Data get silo-ed in multiple systems
Process large data and in realtime
Challenges with traditional
approach
Data no more an
afterthought
How do we solve ?
Standardised data definitions
All data backed by strong schema defined before
ingestion
Instrumentation & ingestion as part of SDLC
Central data platform
abstract data applications from data infra
complexities
ensures data quality - completeness, correctness
scalable to support ingestion, processing , reporting
in various flavours
Flipkart Data approach
Data platform
Majority of tech challenges
now moved to data infra
Support very different workloads
Canned & scheduled
Adhoc & interactive
Realtime and batch
Conducive for systemic and human
consumption
Self serve
Reliability with variety of workloads
at scale is non-trivial
Challenges in central infra
Entity and Events schemas - versioned
Atleast once semantics - entity data
logged as part of transaction
Facts - realtime and batch
Pipeline - Uses Hive, MR, Spark, Storm
Unified query and reporting layer
across Hadoop, Vertica and Elastic-
search
Primitives
Higher level abstraction for stream
to stream joins
Aggregations at query time
Mutable query store - ES
Pipeline - Storm
Generated streams can be consumed
by other systems
Realtime
- Terabytes of data generated in a day
- Billions of raw events in a day
- Thousands of raw data streams
- Petabyte of data processed in a day
- Thousands of Hadoop jobs run in a day
- Thousands of Report views in a day
Order of Scale
Proliferation in no of pipelines and
reports with huge overlap
How to measure Quality of data ?
How long to store different kind of
data? Forever ?
Who owns the dataset ? Who is
responsible for maintaining data
freshness ? …
How do we incentivise right behaviour ?
As adoption grew: further
Challenges
Distributed Data frame - Abstraction to
represent a fully managed dataset (no
relation to spark df or R df)
Abstracts physical representation - both
streaming or batch data
Natively built data quality measures:
Correctness
Completeness
Freshness
DDF
Supports schema evolution with
versioning
Access control policies
Dependency and Lifecycle
management policies
Discovery - schema field, quality
and usage aware
Backup and restore policies
DDF …
All data in
platform as ddf
Flow
Data pipeline that transforms, enrich or
aggregate one or more ddfs producing a ddf
ddf = fn(ddfs)
flow
hive
storm
MR
spark
…
Resource Management:
Quotas and limits to become fully
self serve and ensure Quality of
Service
Further work
Data as part of SDLC
Central platform abstracts data stack
complexities
Higher level constructs for stream to
stream joins
Schema evolution and change management
Data quality is not just schema adherence
Resource management for ensuring
reliability and quality of service
Summary
Thanks!
sharad@apache.org

More Related Content

What's hot

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
 
Resume xiaodan(vinci)
Resume xiaodan(vinci)Resume xiaodan(vinci)
Resume xiaodan(vinci)vinci105
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike GualtieriSpark Summit
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignJuliet Hougland
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationRichard Cyganiak
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthDatabricks
 
Data analysis@network programming
Data analysis@network programmingData analysis@network programming
Data analysis@network programmingRama .
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extractionpzybrick
 
Why Marketing Should Consider Agile Modern Data Delivery Platform
Why Marketing Should Consider Agile Modern Data Delivery PlatformWhy Marketing Should Consider Agile Modern Data Delivery Platform
Why Marketing Should Consider Agile Modern Data Delivery Platformsyed_javed
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirDr. Haxel Consult
 
Best analytics tool
 Best analytics tool Best analytics tool
Best analytics toolRitu Sarkar
 
Quantitative anthropology hackuarium
Quantitative anthropology hackuariumQuantitative anthropology hackuarium
Quantitative anthropology hackuariumJonathan Sobel
 
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of... Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...Dataconomy Media
 
How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
How Starbucks Forecasts Demand at Scale with Facebook Prophet and DatabricksHow Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
How Starbucks Forecasts Demand at Scale with Facebook Prophet and DatabricksNavin Albert
 
Atlas ApacheCon 2017
Atlas ApacheCon 2017Atlas ApacheCon 2017
Atlas ApacheCon 2017Vimal Sharma
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 

What's hot (20)

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
 
Resume xiaodan(vinci)
Resume xiaodan(vinci)Resume xiaodan(vinci)
Resume xiaodan(vinci)
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
II-SDV 2016 VantagePoint
II-SDV 2016 VantagePointII-SDV 2016 VantagePoint
II-SDV 2016 VantagePoint
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
 
Towards Personalization in Global Digital Health
Towards Personalization in Global Digital HealthTowards Personalization in Global Digital Health
Towards Personalization in Global Digital Health
 
Data analysis@network programming
Data analysis@network programmingData analysis@network programming
Data analysis@network programming
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extraction
 
Why Marketing Should Consider Agile Modern Data Delivery Platform
Why Marketing Should Consider Agile Modern Data Delivery PlatformWhy Marketing Should Consider Agile Modern Data Delivery Platform
Why Marketing Should Consider Agile Modern Data Delivery Platform
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
Best analytics tool
 Best analytics tool Best analytics tool
Best analytics tool
 
Quantitative anthropology hackuarium
Quantitative anthropology hackuariumQuantitative anthropology hackuarium
Quantitative anthropology hackuarium
 
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of... Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
Zsolt Várnai, Principal Software Engineer at Skyscanner - "The advantages of...
 
How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
How Starbucks Forecasts Demand at Scale with Facebook Prophet and DatabricksHow Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
 
Atlas ApacheCon 2017
Atlas ApacheCon 2017Atlas ApacheCon 2017
Atlas ApacheCon 2017
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 

Viewers also liked

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Abhimanyu Lad
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Lucidworks
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbLucidworks
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkbanq jdon
 
Derivatives in graphing-dfs
Derivatives in graphing-dfsDerivatives in graphing-dfs
Derivatives in graphing-dfsFarhana Shaheen
 
All analysis
All analysisAll analysis
All analysisAmber_
 
Annie Savant
Annie Savant Annie Savant
Annie Savant adubose
 
Taylor Cutrer
Taylor CutrerTaylor Cutrer
Taylor Cutreradubose
 
Freelance Translator 2.0
Freelance Translator 2.0Freelance Translator 2.0
Freelance Translator 2.0Mike Sekine
 
Recruitment Process Outsourcing
Recruitment Process OutsourcingRecruitment Process Outsourcing
Recruitment Process OutsourcingKapilKumar0111
 
Taylor Smith
Taylor SmithTaylor Smith
Taylor Smithadubose
 
Beginning gl.enchant
Beginning gl.enchantBeginning gl.enchant
Beginning gl.enchantRyo Shimizu
 
Amber’s final magazine
Amber’s final magazineAmber’s final magazine
Amber’s final magazineAmber_
 
Detoxifying your body dfs
Detoxifying your body dfsDetoxifying your body dfs
Detoxifying your body dfsFarhana Shaheen
 
Disposição das equipes jc 2013
Disposição das equipes jc 2013Disposição das equipes jc 2013
Disposição das equipes jc 2013Major Ribamar
 

Viewers also liked (20)

NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]Query Understanding at LinkedIn [Talk at Facebook]
Query Understanding at LinkedIn [Talk at Facebook]
 
Search@airbnb
Search@airbnbSearch@airbnb
Search@airbnb
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
 
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, AirbnbAirbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Derivatives in graphing-dfs
Derivatives in graphing-dfsDerivatives in graphing-dfs
Derivatives in graphing-dfs
 
All analysis
All analysisAll analysis
All analysis
 
Annie Savant
Annie Savant Annie Savant
Annie Savant
 
Taylor Cutrer
Taylor CutrerTaylor Cutrer
Taylor Cutrer
 
Freelance Translator 2.0
Freelance Translator 2.0Freelance Translator 2.0
Freelance Translator 2.0
 
Recruitment Process Outsourcing
Recruitment Process OutsourcingRecruitment Process Outsourcing
Recruitment Process Outsourcing
 
Taylor Smith
Taylor SmithTaylor Smith
Taylor Smith
 
Mean median mode_range
Mean median mode_rangeMean median mode_range
Mean median mode_range
 
Beginning gl.enchant
Beginning gl.enchantBeginning gl.enchant
Beginning gl.enchant
 
Amber’s final magazine
Amber’s final magazineAmber’s final magazine
Amber’s final magazine
 
Detoxifying your body dfs
Detoxifying your body dfsDetoxifying your body dfs
Detoxifying your body dfs
 
Disposição das equipes jc 2013
Disposição das equipes jc 2013Disposição das equipes jc 2013
Disposição das equipes jc 2013
 

Similar to Data Infrastructure at Flipkart (VLDB 2016)

Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An IntroductionShankar R
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJDaniel Madrigal
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
KELLY_MANOVERV.PDF
KELLY_MANOVERV.PDFKELLY_MANOVERV.PDF
KELLY_MANOVERV.PDFHernanKlint
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Big Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextBig Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextInMobi Technology
 
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16AppDynamics
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousingwork
 
Big data presentationandoverview_of_couchbase
Big data presentationandoverview_of_couchbaseBig data presentationandoverview_of_couchbase
Big data presentationandoverview_of_couchbaseAMAR NATH
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data LakesKiran Kamreddy
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupScott Mitchell
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowaleCapgemini
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...Big Data Week
 
UTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big DataUTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big DataMarco Silva
 

Similar to Data Infrastructure at Flipkart (VLDB 2016) (20)

Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
IoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJIoT Crash Course Hadoop Summit SJ
IoT Crash Course Hadoop Summit SJ
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
KELLY_MANOVERV.PDF
KELLY_MANOVERV.PDFKELLY_MANOVERV.PDF
KELLY_MANOVERV.PDF
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Big Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile ContextBig Data and User Segmentation in Mobile Context
Big Data and User Segmentation in Mobile Context
 
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16
Click to Disk Troubleshooting with AppDynamics and OpsDataStore - AppSphere16
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Big data presentationandoverview_of_couchbase
Big data presentationandoverview_of_couchbaseBig data presentationandoverview_of_couchbase
Big data presentationandoverview_of_couchbase
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User GroupBig Data and BI Tools - BI Reporting for Bay Area Startups User Group
Big Data and BI Tools - BI Reporting for Bay Area Startups User Group
 
CWIN17 India / Bigdata architecture yashowardhan sowale
CWIN17 India / Bigdata architecture  yashowardhan sowaleCWIN17 India / Bigdata architecture  yashowardhan sowale
CWIN17 India / Bigdata architecture yashowardhan sowale
 
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
BDW16 London - Deenar Toraskar, Think Reactive - Fast Data Key to Efficient C...
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
UTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big DataUTAD - Jornadas de Informática - Potential of Big Data
UTAD - Jornadas de Informática - Potential of Big Data
 

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Data Infrastructure at Flipkart (VLDB 2016)

  • 1. Data Infrastructure @ Flipkart VLDB 2016 Delhi Sharad Agarwal
  • 2. search browse order ship deliver catalog review & ratings sellers personalisation recommendation pricing offers delivery promise demand forecast inventory wave planning transport planning Route planning Data In Ecommerce
  • 3. Reporting Sales- e.g. Orders by Products, Geo Logistics - 90% percentile delivery times Learning User affinity to a product, price Demand forecast of products by geo Realtime Operational Reporting Congestion points in supply chain Demand shaping with product offers and serviceability Adhoc Analysis find causes of returns in a category Data Applications
  • 5. Data preparation is huge task OLTP model not conducive to analysis Challenges as we grow Data get silo-ed in multiple systems Process large data and in realtime Challenges with traditional approach
  • 6. Data no more an afterthought How do we solve ?
  • 7. Standardised data definitions All data backed by strong schema defined before ingestion Instrumentation & ingestion as part of SDLC Central data platform abstract data applications from data infra complexities ensures data quality - completeness, correctness scalable to support ingestion, processing , reporting in various flavours Flipkart Data approach
  • 9. Majority of tech challenges now moved to data infra
  • 10. Support very different workloads Canned & scheduled Adhoc & interactive Realtime and batch Conducive for systemic and human consumption Self serve Reliability with variety of workloads at scale is non-trivial Challenges in central infra
  • 11. Entity and Events schemas - versioned Atleast once semantics - entity data logged as part of transaction Facts - realtime and batch Pipeline - Uses Hive, MR, Spark, Storm Unified query and reporting layer across Hadoop, Vertica and Elastic- search Primitives
  • 12. Higher level abstraction for stream to stream joins Aggregations at query time Mutable query store - ES Pipeline - Storm Generated streams can be consumed by other systems Realtime
  • 13. - Terabytes of data generated in a day - Billions of raw events in a day - Thousands of raw data streams - Petabyte of data processed in a day - Thousands of Hadoop jobs run in a day - Thousands of Report views in a day Order of Scale
  • 14. Proliferation in no of pipelines and reports with huge overlap How to measure Quality of data ? How long to store different kind of data? Forever ? Who owns the dataset ? Who is responsible for maintaining data freshness ? … How do we incentivise right behaviour ? As adoption grew: further Challenges
  • 15. Distributed Data frame - Abstraction to represent a fully managed dataset (no relation to spark df or R df) Abstracts physical representation - both streaming or batch data Natively built data quality measures: Correctness Completeness Freshness DDF
  • 16. Supports schema evolution with versioning Access control policies Dependency and Lifecycle management policies Discovery - schema field, quality and usage aware Backup and restore policies DDF …
  • 18. Flow Data pipeline that transforms, enrich or aggregate one or more ddfs producing a ddf ddf = fn(ddfs) flow hive storm MR spark …
  • 19. Resource Management: Quotas and limits to become fully self serve and ensure Quality of Service Further work
  • 20. Data as part of SDLC Central platform abstracts data stack complexities Higher level constructs for stream to stream joins Schema evolution and change management Data quality is not just schema adherence Resource management for ensuring reliability and quality of service Summary