3. Reporting
Sales- e.g. Orders by Products, Geo
Logistics - 90% percentile delivery times
Learning
User affinity to a product, price
Demand forecast of products by geo
Realtime Operational Reporting
Congestion points in supply chain
Demand shaping with product offers and serviceability
Adhoc Analysis
find causes of returns in a category
Data Applications
5. Data preparation is huge task
OLTP model not conducive to analysis
Challenges as we grow
Data get silo-ed in multiple systems
Process large data and in realtime
Challenges with traditional
approach
7. Standardised data definitions
All data backed by strong schema defined before
ingestion
Instrumentation & ingestion as part of SDLC
Central data platform
abstract data applications from data infra
complexities
ensures data quality - completeness, correctness
scalable to support ingestion, processing , reporting
in various flavours
Flipkart Data approach
10. Support very different workloads
Canned & scheduled
Adhoc & interactive
Realtime and batch
Conducive for systemic and human
consumption
Self serve
Reliability with variety of workloads
at scale is non-trivial
Challenges in central infra
11. Entity and Events schemas - versioned
Atleast once semantics - entity data
logged as part of transaction
Facts - realtime and batch
Pipeline - Uses Hive, MR, Spark, Storm
Unified query and reporting layer
across Hadoop, Vertica and Elastic-
search
Primitives
12. Higher level abstraction for stream
to stream joins
Aggregations at query time
Mutable query store - ES
Pipeline - Storm
Generated streams can be consumed
by other systems
Realtime
13. - Terabytes of data generated in a day
- Billions of raw events in a day
- Thousands of raw data streams
- Petabyte of data processed in a day
- Thousands of Hadoop jobs run in a day
- Thousands of Report views in a day
Order of Scale
14. Proliferation in no of pipelines and
reports with huge overlap
How to measure Quality of data ?
How long to store different kind of
data? Forever ?
Who owns the dataset ? Who is
responsible for maintaining data
freshness ? …
How do we incentivise right behaviour ?
As adoption grew: further
Challenges
15. Distributed Data frame - Abstraction to
represent a fully managed dataset (no
relation to spark df or R df)
Abstracts physical representation - both
streaming or batch data
Natively built data quality measures:
Correctness
Completeness
Freshness
DDF
16. Supports schema evolution with
versioning
Access control policies
Dependency and Lifecycle
management policies
Discovery - schema field, quality
and usage aware
Backup and restore policies
DDF …
20. Data as part of SDLC
Central platform abstracts data stack
complexities
Higher level constructs for stream to
stream joins
Schema evolution and change management
Data quality is not just schema adherence
Resource management for ensuring
reliability and quality of service
Summary