SlideShare a Scribd company logo
1 of 30
Download to read offline
August 8, 2012




Cassandra at eBay
    Time left: 29m 59s




                     Jay Patel
                     Architect, Platform Systems
                     @pateljay3001
eBay Marketplaces
 97 million active buyers and sellers
 200+ million items
 2 billion page views each day
 80 billion database calls each day
 5+ petabytes of site storage capacity
 80+ petabytes of analytics storage capacity

                                                2
How do we scale databases?
 Shard
   – Patterns: Modulus, lookup-based, range, etc.
   – Application sees only logical shard/database
 Replicate
   – Disaster recovery, read availability/scalability
 Big NOs
   – No transactions
   – No joins
   – No referential integrity constraints
                                                        3
We like Cassandra
 Multi-datacenter (active-active)    Write performance
 Availability - No SPOF              Distributed counters
 Scalability                         Hadoop support



We also utilize MongoDB & HBase




                                                              4
Are we replacing RDBMS with NoSQL?

          Not at all! But, complementing.
 Some use cases don’t fit well - sparse data, big data, schema
  optional, real-time analytics, …
 Many use cases don’t need top-tier set-ups - logging, tracking, …




                                                                  5
A glimpse on our Cassandra deployment
 Dozens of nodes across multiple clusters
 200 TB+ storage provisioned
 400M+ writes & 100M+ reads per day, and growing
 QA, LnP, and multiple Production clusters




                                                    6
Use Cases on Cassandra
      Social Signals on eBay product & item pages
      Hunch taste graph for eBay users & items
      Time series use cases (many):
     Mobile notification logging and tracking
     Tracking for fraud detection
     SOA request/response payload logging
     RedLaser server logs and analytics

                                                    7
Served by
Cassandra




            8
Manage signals via “Your Favorites”




                                      Whole page is
                                      served by
                                      Cassandra




                                                9
Why Cassandra for Social Signals?
 Need scalable counters
 Need real (or near) time analytics on collected social data
 Need good write performance
 Reads are not latency sensitive




                                                                10
Deployment
                 User request has no datacenter affinity


                           Non-sticky load balancing




Topology - NTS           Data is backed up periodically
RF - 2:2                 to protect against human or
Read CL - ONE            software error
Write CL – ONE

                                                       11
Data Model
             depends on query patterns




                                         12
Data Model (simplified)




                          13
Wait…



                    Duplicates!




        Oh, toggle button!
        Signal --> De-signal --> Signal…
                                       14
Yes, eventual consistency!
One scenario that produces duplicate signals in UserLike CF:
   1. Signal
   2. De-signal (1st operation is not propagated to all replica)
   3. Signal, again (1st operation is not propagated yet!)



 So, what’s the solution? Later…

                                                                   15
Social Signals, next phase: Real-time Analytics
 Most signaled or popular items per affinity groups (category, etc.)
 Aggregated item count per affinity group



                                                     Example affinity group




                                                                              16
Initial Data Model for real-time analytics

                                               Items in an affinitygroup
                                               is physically stored
                                               sorted by their signal
                                               count




                           Update counters for both individual item
                           and all the affinity groups that item
                           belongs to
Deployment, next phase




Topology - NTS
RF - 2:2:2
user1       bid
                                  item1
        buy

item2         watch               sell
                        user2




                                          19
Graph in Cassandra
Event consumers listen for site events (sell/bid/buy/watch) & populate graph in Cassandra




   30 million+ writes daily                Batch-oriented reads
   14 billion+ edges already                (for taste vector updates)
                                                                                    20
 Mobile notification logging and tracking
 Tracking for fraud detection
 SOA request/response payload logging
 RedLaser server logs and analytics




                                             21
A glimpse on Data Model
RedLaser tracking & monitoring console




                                         23
That’s all about the use cases..
Remember the duplicate problem in Use Case #1?




  Let’s see some options we considered to solve this…
                                                    24
Option 1 – Make ‘Like’ idempotent for UserLike
 Remove time (timeuuid) from the composite column name:
    Multiple signal operations are now Idempotent
    No need to read before de-signaling (deleting)




    X            Need timeuuid for ordering!
                 Already have a user with more than 1300 signals   25
Option 2 – Use strong consistency

 Local Quorum
  – Won’t help us. User requests are not geo-load balanced
    (no DC affinity).
 Quorum
  – Won’t survive during partition between DCs (or, one of the
    DC is down). Also, adds additional latency.

              X      Need to survive!
                                                             26
Option 3 – Adapt to eventual consistency
If desire survival!




                                                                              27
                      http://www.strangecosmos.com/content/item/101254.html
Adjustments to eventual consistency
 De-signal steps:
      – Don’t check whether item is already signaled by a user, or not
      – Read all (duplicate) signals from UserLike_unordered (new CF to avoid reading
        whole row from UserLike)
      – Delete those signals from UserLike_unordered and UserLike




Still, can get duplicate signals or false positives as there is a ‘read before delete’.
To shield further, do ‘repair on read’.                  Not a full story!
                                                                                     28
Lessons & Best Practices
• Choose proper Replication Factor and Consistency Level.
    – They alter latency, availability, durability, consistency and cost.
    – Cassandra supports tunable consistency, but remember strong consistency is not free.
• Consider all overheads in capacity planning.
    – Replicas, compaction, secondary indexes, etc.
• De-normalize and duplicate for read performance.
    – But don’t de-normalize if you don’t need to.
• Many ways to model data in Cassandra.
    – The best way depends on your use case and query patterns.
                More on http://ebaytechblog.com?p=1308
Thank You
  @pateljay3001
  #cassandra12
                  30

More Related Content

What's hot

Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debeziumKasun Don
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 

What's hot (20)

Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
 
kafka
kafkakafka
kafka
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 

Viewers also liked

Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandraNguyen Quang
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseDataStax Academy
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraAdrian Cockcroft
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query languageCourtney Robinson
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...DataStax Academy
 
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...DataStax
 

Viewers also liked (7)

Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query language
 
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
 
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
 

Similar to Cassandra at eBay - Cassandra Summit 2012

Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsPanagiotis Papadopoulos
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Trayan Iliev
 
Real World Cassandra
Real World CassandraReal World Cassandra
Real World CassandraGiltTech
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scalexcbsmith
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastMapR Technologies
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationShanley Kane
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists jlacefie
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overviewElifTech
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011IndicThreads
 
High-speed, Reactive Microservices 2017
High-speed, Reactive Microservices 2017High-speed, Reactive Microservices 2017
High-speed, Reactive Microservices 2017Rick Hightower
 

Similar to Cassandra at eBay - Cassandra Summit 2012 (20)

Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Cassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and LimitationsCassandra Consistency: Tradeoffs and Limitations
Cassandra Consistency: Tradeoffs and Limitations
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
Real World Cassandra
Real World CassandraReal World Cassandra
Real World Cassandra
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and Fast
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Apache Cassandra overview
Apache Cassandra overviewApache Cassandra overview
Apache Cassandra overview
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011Monitoring applications on cloud - Indicthreads cloud computing conference 2011
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
 
High-speed, Reactive Microservices 2017
High-speed, Reactive Microservices 2017High-speed, Reactive Microservices 2017
High-speed, Reactive Microservices 2017
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Cassandra at eBay - Cassandra Summit 2012

  • 1. August 8, 2012 Cassandra at eBay Time left: 29m 59s Jay Patel Architect, Platform Systems @pateljay3001
  • 2. eBay Marketplaces  97 million active buyers and sellers  200+ million items  2 billion page views each day  80 billion database calls each day  5+ petabytes of site storage capacity  80+ petabytes of analytics storage capacity 2
  • 3. How do we scale databases?  Shard – Patterns: Modulus, lookup-based, range, etc. – Application sees only logical shard/database  Replicate – Disaster recovery, read availability/scalability  Big NOs – No transactions – No joins – No referential integrity constraints 3
  • 4. We like Cassandra  Multi-datacenter (active-active)  Write performance  Availability - No SPOF  Distributed counters  Scalability  Hadoop support We also utilize MongoDB & HBase 4
  • 5. Are we replacing RDBMS with NoSQL? Not at all! But, complementing.  Some use cases don’t fit well - sparse data, big data, schema optional, real-time analytics, …  Many use cases don’t need top-tier set-ups - logging, tracking, … 5
  • 6. A glimpse on our Cassandra deployment  Dozens of nodes across multiple clusters  200 TB+ storage provisioned  400M+ writes & 100M+ reads per day, and growing  QA, LnP, and multiple Production clusters 6
  • 7. Use Cases on Cassandra Social Signals on eBay product & item pages Hunch taste graph for eBay users & items Time series use cases (many):  Mobile notification logging and tracking  Tracking for fraud detection  SOA request/response payload logging  RedLaser server logs and analytics 7
  • 9. Manage signals via “Your Favorites” Whole page is served by Cassandra 9
  • 10. Why Cassandra for Social Signals?  Need scalable counters  Need real (or near) time analytics on collected social data  Need good write performance  Reads are not latency sensitive 10
  • 11. Deployment User request has no datacenter affinity Non-sticky load balancing Topology - NTS Data is backed up periodically RF - 2:2 to protect against human or Read CL - ONE software error Write CL – ONE 11
  • 12. Data Model depends on query patterns 12
  • 14. Wait… Duplicates! Oh, toggle button! Signal --> De-signal --> Signal… 14
  • 15. Yes, eventual consistency! One scenario that produces duplicate signals in UserLike CF: 1. Signal 2. De-signal (1st operation is not propagated to all replica) 3. Signal, again (1st operation is not propagated yet!) So, what’s the solution? Later… 15
  • 16. Social Signals, next phase: Real-time Analytics  Most signaled or popular items per affinity groups (category, etc.)  Aggregated item count per affinity group Example affinity group 16
  • 17. Initial Data Model for real-time analytics Items in an affinitygroup is physically stored sorted by their signal count Update counters for both individual item and all the affinity groups that item belongs to
  • 19. user1 bid item1 buy item2 watch sell user2 19
  • 20. Graph in Cassandra Event consumers listen for site events (sell/bid/buy/watch) & populate graph in Cassandra  30 million+ writes daily  Batch-oriented reads  14 billion+ edges already (for taste vector updates) 20
  • 21.  Mobile notification logging and tracking  Tracking for fraud detection  SOA request/response payload logging  RedLaser server logs and analytics 21
  • 22. A glimpse on Data Model
  • 23. RedLaser tracking & monitoring console 23
  • 24. That’s all about the use cases.. Remember the duplicate problem in Use Case #1? Let’s see some options we considered to solve this… 24
  • 25. Option 1 – Make ‘Like’ idempotent for UserLike  Remove time (timeuuid) from the composite column name:  Multiple signal operations are now Idempotent  No need to read before de-signaling (deleting) X Need timeuuid for ordering! Already have a user with more than 1300 signals 25
  • 26. Option 2 – Use strong consistency  Local Quorum – Won’t help us. User requests are not geo-load balanced (no DC affinity).  Quorum – Won’t survive during partition between DCs (or, one of the DC is down). Also, adds additional latency. X Need to survive! 26
  • 27. Option 3 – Adapt to eventual consistency If desire survival! 27 http://www.strangecosmos.com/content/item/101254.html
  • 28. Adjustments to eventual consistency De-signal steps: – Don’t check whether item is already signaled by a user, or not – Read all (duplicate) signals from UserLike_unordered (new CF to avoid reading whole row from UserLike) – Delete those signals from UserLike_unordered and UserLike Still, can get duplicate signals or false positives as there is a ‘read before delete’. To shield further, do ‘repair on read’. Not a full story! 28
  • 29. Lessons & Best Practices • Choose proper Replication Factor and Consistency Level. – They alter latency, availability, durability, consistency and cost. – Cassandra supports tunable consistency, but remember strong consistency is not free. • Consider all overheads in capacity planning. – Replicas, compaction, secondary indexes, etc. • De-normalize and duplicate for read performance. – But don’t de-normalize if you don’t need to. • Many ways to model data in Cassandra. – The best way depends on your use case and query patterns. More on http://ebaytechblog.com?p=1308
  • 30. Thank You @pateljay3001 #cassandra12 30