SlideShare a Scribd company logo
1 of 32
Download to read offline
MongoDB to Cassandra
                         The Atlas Odyssey




Fred van den Driessche     Tom McAdam        Adam Horwich
       Engineer                CTO           Systems Engineer
       @fredvdd                @tfm          @Mmmkayness
http://flickr.com/photos/dhammza/88644497/
Our platform - late 2012

                                                                         tbc                   tbc




                             MetaBroadcast platform



Video and audio metadata       Profiles and activity from video and
    from 20+ sources                                                 Analytic requests and groupings
                                 audio products, social networks
?
Main clients                   Main Partners




               Data Partners
What is Atlas?
                           /content
BBC
                          /schedules

                            /topics
 PA

              ATLAS

 C4
                           sitemaps

                          radioplayer
etc...         DB
                          interlinking
DEMO
Atlas Data Model

brand                      item




series                    version




              broadcast             location
MongoDB


• flexible

• features

• really simple

• shell
Where MongoDB falls short


• too simple

• lack of control

• sharding

• embedding
Where to?
Where to?




•   add a cache?
Atlas API
•       content

    •     http://atlas.metabroadcast.com/3.0/content.json?uri=http://www.bbc.co.uk/programmes/
          b0074g7p&annotations=description,brand_summary,locations&apiKey=6ed2a984627daff816198acde82

    •     http://atlas.metabroadcast.com/3.0/content.json?apiKey=aaaa&uri=http://www.bbc.co.uk/programmes/
          b0074g7p&annotations=description,brand_summary,locations

•       schedules

    •     http://atlas.metabroadcast.com/3.0/schedule.json?from=now&to=now.plus.
          3h&channel=bbcone&publisher=bbc.co.uk

    •     http://atlas.metabroadcast.com/3.0/schedule.json?
          from=1948-12-24&to=1948-12-25&channel=radio4&publisher=bbc.co.uk

•       api explorer http://atlas.metabroadcast.com/#apiExplorer
Atlas API
•       content

    •     http://atlas.metabroadcast.com/3.0/content.json?uri=http://www.bbc.co.uk/programmes/
          b0074g7p&annotations=description,brand_summary,locations&apiKey=6ed2a984627daff816198acde82

    •     http://atlas.metabroadcast.com/3.0/content.json?apiKey=aaaa&uri=http://www.bbc.co.uk/programmes/
          b0074g7p&annotations=description,brand_summary,locations

•       schedules

    •     http://atlas.metabroadcast.com/3.0/schedule.json?from=now&to=now.plus.
          3h&channel=bbcone&publisher=bbc.co.uk

    •     http://atlas.metabroadcast.com/3.0/schedule.json?
          from=1948-12-24&to=1948-12-25&channel=radio4&publisher=bbc.co.uk

•       api explorer http://atlas.metabroadcast.com/#apiExplorer
Why Cassandra?


•scalability/performance

• row caches

• consistency control

• column-based model matches our use case
And?



• ElasticSearch

• messaging

• tooling: bootstraps
What is Atlas?
BBC
         Data ingest
           server             DB
 PA



 C4
                 Update bus        HTTP server

etc...




                              ES
Data model
•   columns to model annotations




•   secondary indexes
    •   index.direct(keyspace, SEGMENT_URI_INDEX_CF, ConsistencyLevel.CL_QUORUM).


            from(segment.getCanonicalUri()).
            to(segment.getIdentifier()).
            index().execute(requestTimeout, TimeUnit.MILLISECONDS);
ID generation
• give external data our own ID on ingest

• needs to be user-friendly:
  http://www.radiotimes.com/programme/cf2/eastenders

• mongo: findAndModify()

• solution: uses Astyanax client with its distributed locking

• more details: http://metabroadcast.com/blog/let-
  cassandra-identify-your-data
Where we’re at



• already live with some data

• alpha release of schedule endpoint coming soon

• later: roll out across other endpoints
Ops
Ops in Cassandra

•   we love Puppet
•    it’s great for automation and deployment

•    MongoDB: 1 file

•    Cassandra: 2 files!




•   oh... tokens
Cassandra Tokens

•   define where data is written to
    in a cluster

•   therefore balanced tokens =
    balanced cluster

•   tokens should be rack aware
•    tools available to provide appropriate tokens
     for you
Cassandra plays nicely with AWS

•   datacentre / rack aware
•    AWS Region = Datacentre

•    AWS Availability Zone = Rack


•   only recently introduced in MongoDB but simple to
    implement in Cassandra

•   horizontally (and vertically) scalable
Monitoring

•   Nagios is a little threadbare for Cassandra
•    basic TCP service check

•    stats from API not very helpful


•   nodetool and CLI tools useful
•    manual effort to integrate them


•   if only there was some useful service...
OpsCenter

•   wonderful for an overview
•    not so much for alerting ;)




•   ohai API
•    can integrate metrics into Nagios
Disaster Recovery

•   we operate a 4 node cluster presently
 •   replication factor of 3 with quorum read/writes


•   DR complicated by tokens

• cluster should be balanced

• snapshot + S3 Backups
Cluster Happiness and Headaches

•   little maintenance overhead

• cluster rebalancing
 •   uncommon maintenance procedure




•   schema changes are cumbersome
 •   little scope for rollback, can put cluster in unrecoverable state
Summary



• Mongo is good, Atlas has outgrown it

• Cassandra isn’t a drop-in replacement

• Ops more complex but so far so good
Questions?

More Related Content

What's hot

Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Data Con LA
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
 
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQLRethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQLKai Wähner
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaJason Plurad
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyKairo Tavares
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...confluent
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit
 

What's hot (20)

Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQLRethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Graph Processing with Titan and Scylla
Graph Processing with Titan and ScyllaGraph Processing with Titan and Scylla
Graph Processing with Titan and Scylla
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 

Viewers also liked

SQL, noSQL or no database at all? Are databases still a core skill?
SQL, noSQL or no database at all? Are databases still a core skill?SQL, noSQL or no database at all? Are databases still a core skill?
SQL, noSQL or no database at all? Are databases still a core skill?Neil Saunders
 
Phytoremediation
PhytoremediationPhytoremediation
PhytoremediationRANJANI
 
phyto & myco remediation.
phyto & myco remediation.phyto & myco remediation.
phyto & myco remediation.aiman786000
 
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment - By Haseeb
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment  - By HaseebPHYTOREMEDIATION - Using Plants To Clean Up Our Environment  - By Haseeb
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment - By HaseebHaseeb Gerraddict
 
Phytoremediation.ppt
Phytoremediation.pptPhytoremediation.ppt
Phytoremediation.pptHalala Rahman
 
Phytoremediation
PhytoremediationPhytoremediation
Phytoremediationnazish66
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword Haitham El-Ghareeb
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBrian Enochson
 

Viewers also liked (11)

SQL, noSQL or no database at all? Are databases still a core skill?
SQL, noSQL or no database at all? Are databases still a core skill?SQL, noSQL or no database at all? Are databases still a core skill?
SQL, noSQL or no database at all? Are databases still a core skill?
 
Phytoremediation
PhytoremediationPhytoremediation
Phytoremediation
 
phyto & myco remediation.
phyto & myco remediation.phyto & myco remediation.
phyto & myco remediation.
 
Phytoremediation
PhytoremediationPhytoremediation
Phytoremediation
 
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment - By Haseeb
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment  - By HaseebPHYTOREMEDIATION - Using Plants To Clean Up Our Environment  - By Haseeb
PHYTOREMEDIATION - Using Plants To Clean Up Our Environment - By Haseeb
 
Phytoremediation.ppt
Phytoremediation.pptPhytoremediation.ppt
Phytoremediation.ppt
 
Phytoremediation
PhytoremediationPhytoremediation
Phytoremediation
 
Phytoremediation
PhytoremediationPhytoremediation
Phytoremediation
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Big Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and CassasdraBig Data, NoSQL with MongoDB and Cassasdra
Big Data, NoSQL with MongoDB and Cassasdra
 

Similar to MongoDB to Cassandra: Migrating the Atlas Metadata Platform to Cassandra

MongoDB World 2019: Terraform New Worlds on MongoDB Atlas
MongoDB World 2019: Terraform New Worlds on MongoDB Atlas MongoDB World 2019: Terraform New Worlds on MongoDB Atlas
MongoDB World 2019: Terraform New Worlds on MongoDB Atlas MongoDB
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Kim Hammar
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nlbartzon
 
Webinar: Serverless Architectures with AWS Lambda and MongoDB Atlas
Webinar: Serverless Architectures with AWS Lambda and MongoDB AtlasWebinar: Serverless Architectures with AWS Lambda and MongoDB Atlas
Webinar: Serverless Architectures with AWS Lambda and MongoDB AtlasMongoDB
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Docker, Inc.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Piotr Findeisen
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure DataTaro L. Saito
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 

Similar to MongoDB to Cassandra: Migrating the Atlas Metadata Platform to Cassandra (20)

MongoDB World 2019: Terraform New Worlds on MongoDB Atlas
MongoDB World 2019: Terraform New Worlds on MongoDB Atlas MongoDB World 2019: Terraform New Worlds on MongoDB Atlas
MongoDB World 2019: Terraform New Worlds on MongoDB Atlas
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
Spark ai summit_oct_17_2019_kimhammar_jimdowling_v6
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Webinar: Serverless Architectures with AWS Lambda and MongoDB Atlas
Webinar: Serverless Architectures with AWS Lambda and MongoDB AtlasWebinar: Serverless Architectures with AWS Lambda and MongoDB Atlas
Webinar: Serverless Architectures with AWS Lambda and MongoDB Atlas
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
The Best of re:invent 2016
The Best of re:invent 2016The Best of re:invent 2016
The Best of re:invent 2016
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

MongoDB to Cassandra: Migrating the Atlas Metadata Platform to Cassandra

  • 1. MongoDB to Cassandra The Atlas Odyssey Fred van den Driessche Tom McAdam Adam Horwich Engineer CTO Systems Engineer @fredvdd @tfm @Mmmkayness
  • 2.
  • 4. Our platform - late 2012 tbc tbc MetaBroadcast platform Video and audio metadata Profiles and activity from video and from 20+ sources Analytic requests and groupings audio products, social networks
  • 5. ?
  • 6. Main clients Main Partners Data Partners
  • 7. What is Atlas? /content BBC /schedules /topics PA ATLAS C4 sitemaps radioplayer etc... DB interlinking
  • 9. Atlas Data Model brand item series version broadcast location
  • 10. MongoDB • flexible • features • really simple • shell
  • 11. Where MongoDB falls short • too simple • lack of control • sharding • embedding
  • 13. Where to? • add a cache?
  • 14. Atlas API • content • http://atlas.metabroadcast.com/3.0/content.json?uri=http://www.bbc.co.uk/programmes/ b0074g7p&annotations=description,brand_summary,locations&apiKey=6ed2a984627daff816198acde82 • http://atlas.metabroadcast.com/3.0/content.json?apiKey=aaaa&uri=http://www.bbc.co.uk/programmes/ b0074g7p&annotations=description,brand_summary,locations • schedules • http://atlas.metabroadcast.com/3.0/schedule.json?from=now&to=now.plus. 3h&channel=bbcone&publisher=bbc.co.uk • http://atlas.metabroadcast.com/3.0/schedule.json? from=1948-12-24&to=1948-12-25&channel=radio4&publisher=bbc.co.uk • api explorer http://atlas.metabroadcast.com/#apiExplorer
  • 15.
  • 16. Atlas API • content • http://atlas.metabroadcast.com/3.0/content.json?uri=http://www.bbc.co.uk/programmes/ b0074g7p&annotations=description,brand_summary,locations&apiKey=6ed2a984627daff816198acde82 • http://atlas.metabroadcast.com/3.0/content.json?apiKey=aaaa&uri=http://www.bbc.co.uk/programmes/ b0074g7p&annotations=description,brand_summary,locations • schedules • http://atlas.metabroadcast.com/3.0/schedule.json?from=now&to=now.plus. 3h&channel=bbcone&publisher=bbc.co.uk • http://atlas.metabroadcast.com/3.0/schedule.json? from=1948-12-24&to=1948-12-25&channel=radio4&publisher=bbc.co.uk • api explorer http://atlas.metabroadcast.com/#apiExplorer
  • 17. Why Cassandra? •scalability/performance • row caches • consistency control • column-based model matches our use case
  • 19. What is Atlas? BBC Data ingest server DB PA C4 Update bus HTTP server etc... ES
  • 20. Data model • columns to model annotations • secondary indexes • index.direct(keyspace, SEGMENT_URI_INDEX_CF, ConsistencyLevel.CL_QUORUM). from(segment.getCanonicalUri()). to(segment.getIdentifier()). index().execute(requestTimeout, TimeUnit.MILLISECONDS);
  • 21. ID generation • give external data our own ID on ingest • needs to be user-friendly: http://www.radiotimes.com/programme/cf2/eastenders • mongo: findAndModify() • solution: uses Astyanax client with its distributed locking • more details: http://metabroadcast.com/blog/let- cassandra-identify-your-data
  • 22. Where we’re at • already live with some data • alpha release of schedule endpoint coming soon • later: roll out across other endpoints
  • 23. Ops
  • 24. Ops in Cassandra • we love Puppet • it’s great for automation and deployment • MongoDB: 1 file • Cassandra: 2 files! • oh... tokens
  • 25. Cassandra Tokens • define where data is written to in a cluster • therefore balanced tokens = balanced cluster • tokens should be rack aware • tools available to provide appropriate tokens for you
  • 26. Cassandra plays nicely with AWS • datacentre / rack aware • AWS Region = Datacentre • AWS Availability Zone = Rack • only recently introduced in MongoDB but simple to implement in Cassandra • horizontally (and vertically) scalable
  • 27. Monitoring • Nagios is a little threadbare for Cassandra • basic TCP service check • stats from API not very helpful • nodetool and CLI tools useful • manual effort to integrate them • if only there was some useful service...
  • 28. OpsCenter • wonderful for an overview • not so much for alerting ;) • ohai API • can integrate metrics into Nagios
  • 29. Disaster Recovery • we operate a 4 node cluster presently • replication factor of 3 with quorum read/writes • DR complicated by tokens • cluster should be balanced • snapshot + S3 Backups
  • 30. Cluster Happiness and Headaches • little maintenance overhead • cluster rebalancing • uncommon maintenance procedure • schema changes are cumbersome • little scope for rollback, can put cluster in unrecoverable state
  • 31. Summary • Mongo is good, Atlas has outgrown it • Cassandra isn’t a drop-in replacement • Ops more complex but so far so good