SlideShare a Scribd company logo
1 of 18
Download to read offline
Cloud based low cost, low
maintenance, scalable data platform
Apoorva Gaurav
Why hunt elephant to sell shoes ?
Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT
Use case : List products based on CTR
● Take all impressions of a product and action
performed
● Some products are more attractive than
others
● Give benefit to such products
Use case : List products based on CTR
● select product_id, sum(clicked)/sum
(appeared) as ctr from tbl_prod_log group
by product_id order by ctr desc
● >100K products, > 500 million impressions
a day --- DIFFICULT TO SCALE
Use case : User segmentation
● Different users have
different browsing
patterns
● Segment them based
on their history
● Provide them different
experience
Use case : User segmentation
● select depth, count
(cookie_id), group by
depth from user_log
● > 1m users daily,
multiple browsers,
devices
● DIFFICULT TO SCALE
Use case : Recommend similar products
● Compute score of
products based on
various attributes
● Compute score of a
user based on
products (s)he
browses
● Recommend similar
products
Use case : Recommend similar products
● select id, (w1.att1 + w2.
att2 + ... wN.attN) as
score from products
● select userid, (v1.
score1 + v2.score2 + ...
+ vN.scoreN)
● >1m user >100K
products DIFFICULT TO
COMPUTE
Constraints
● Fast paced
● Tangible results
● Limited budget
● Low engineering bandwidth
Design goals
● Solution should be able to scale up and
down
● Record data now, ask questions later
● Generic data model
● Segregate reads from writes
● Low running cost
● Low maintenance overhead
Cloud computing
Pros
● No setup cost
● Pay as you use
● Scaling is a breeze
● Managed services
Cons
● Performance
● Reliability
● Data security
● Control
A very basic Big Data system
Highly available
Very low latency
Initial filtering
Storage agnostic
Scale up and down
easily
Essentially distributed
Very easy to use
Highly reliable
Huge capacity
Cater to any data model
Cheap
Architecture Diagram
Architecture Diagram Hadoop on cloud
Easy to scale up
and down
Pay as you use
Infinite capacity
11 nines of
durability
Flat file storage
Cheap
Persistent
distributed Q
100K msg/sec
Events can be
played back
Highly concurrent
server
Very easy to use
Flexible
Much easier to
introduce HA,
reliability etc
Both server and
client side data
Segregate and
upload events to
S3
Scales horizontally
Distributed
config mgmt
Fault tolerant
Some numbers
● ~20 million events getting logged daily
● Corresponds to ~800 million data points
● & ~25GB
● Close to a 100 jobs a day
● The biggest job has footprints of ~2
billion events
● Platform costs ~20$ daily; jobs ~15$ daily
● One can code in english (Finagle)
myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C or Erlang to be performant (Kafka)
● Can search without index
s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30
s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
● Spot EMR clusters effeciently
● m1.small are not small
● awk + grep = awesome
● Apache mailing lists SUCK!!!
Some key learnings
Thank you!!
& we are hiring
apoorva.gaurav@myntra.com

More Related Content

Similar to Myntra.com's Big Data Platform

MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.fr
beboutou
 

Similar to Myntra.com's Big Data Platform (20)

Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
SaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering ChallengesSaaS startups - Software Engineering Challenges
SaaS startups - Software Engineering Challenges
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps[WSO2Con USA 2018] Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
 
Google на конференции Big Data Russia
Google на конференции Big Data RussiaGoogle на конференции Big Data Russia
Google на конференции Big Data Russia
 
Design Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System ArchitectureDesign Like a Pro: How to Pick the Right System Architecture
Design Like a Pro: How to Pick the Right System Architecture
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
 
MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.fr
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesOvercoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Using Elasticsearch for Analytics
Using Elasticsearch for AnalyticsUsing Elasticsearch for Analytics
Using Elasticsearch for Analytics
 
Building what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructureBuilding what's next with google cloud's powerful infrastructure
Building what's next with google cloud's powerful infrastructure
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Myntra.com's Big Data Platform

  • 1. Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav
  • 2. Why hunt elephant to sell shoes ?
  • 3. Why hunt elephant to sell shoes ? WHOM HOW WHAT
  • 4. Use case : List products based on CTR ● Take all impressions of a product and action performed ● Some products are more attractive than others ● Give benefit to such products
  • 5. Use case : List products based on CTR ● select product_id, sum(clicked)/sum (appeared) as ctr from tbl_prod_log group by product_id order by ctr desc ● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE
  • 6. Use case : User segmentation ● Different users have different browsing patterns ● Segment them based on their history ● Provide them different experience
  • 7. Use case : User segmentation ● select depth, count (cookie_id), group by depth from user_log ● > 1m users daily, multiple browsers, devices ● DIFFICULT TO SCALE
  • 8. Use case : Recommend similar products ● Compute score of products based on various attributes ● Compute score of a user based on products (s)he browses ● Recommend similar products
  • 9. Use case : Recommend similar products ● select id, (w1.att1 + w2. att2 + ... wN.attN) as score from products ● select userid, (v1. score1 + v2.score2 + ... + vN.scoreN) ● >1m user >100K products DIFFICULT TO COMPUTE
  • 10. Constraints ● Fast paced ● Tangible results ● Limited budget ● Low engineering bandwidth
  • 11. Design goals ● Solution should be able to scale up and down ● Record data now, ask questions later ● Generic data model ● Segregate reads from writes ● Low running cost ● Low maintenance overhead
  • 12. Cloud computing Pros ● No setup cost ● Pay as you use ● Scaling is a breeze ● Managed services Cons ● Performance ● Reliability ● Data security ● Control
  • 13. A very basic Big Data system Highly available Very low latency Initial filtering Storage agnostic Scale up and down easily Essentially distributed Very easy to use Highly reliable Huge capacity Cater to any data model Cheap
  • 15. Architecture Diagram Hadoop on cloud Easy to scale up and down Pay as you use Infinite capacity 11 nines of durability Flat file storage Cheap Persistent distributed Q 100K msg/sec Events can be played back Highly concurrent server Very easy to use Flexible Much easier to introduce HA, reliability etc Both server and client side data Segregate and upload events to S3 Scales horizontally Distributed config mgmt Fault tolerant
  • 16. Some numbers ● ~20 million events getting logged daily ● Corresponds to ~800 million data points ● & ~25GB ● Close to a 100 jobs a day ● The biggest job has footprints of ~2 billion events ● Platform costs ~20$ daily; jobs ~15$ daily
  • 17. ● One can code in english (Finagle) myService = handleExceptions andThen recordInKafka andThen respond ● Need not be in C or Erlang to be performant (Kafka) ● Can search without index s3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30 s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30 ● Spot EMR clusters effeciently ● m1.small are not small ● awk + grep = awesome ● Apache mailing lists SUCK!!! Some key learnings
  • 18. Thank you!! & we are hiring apoorva.gaurav@myntra.com