Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrating Data Pipeline from MongoDB to Cassandra

MongoDB is a great NoSQL database, it’s very flexible and easy to use,
but would it handle massive Read / Write throughput?
actually, what happens when you need to scale everything out and easily?
We will lay out the reasons and the steps of migrating our data pipeline to Apache Cassandra in a short period without having any prior knowledge.
We’ll list our lessons learned as well.

Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Organizer of the “Big Things” Big Data community:http://somebigthings.com/big-things-intro/

Migrating Data Pipeline from MongoDB to Cassandra

  1. 1. Demi Ben-Ari Cassandra Meetup – 10/11/2015 Israel
  2. 2. About me Demi Ben-Ari Senior Software Engineer at Windward Ltd. BS’c Computer Science – Academic College Tel-Aviv Yaffo In the Past: Software Team Leader & Senior Java Software Engineer, Missile defense and Alert System - “Ofek” unit - IAF
  3. 3. Agenda  Data flow with Mongo DB  The Problem  Solution  Lessons learned from a Newbi  Conclusion
  4. 4. Environment Description Cluster Dev Testing Live Staging ProductionEnv
  5. 5. Data pipeline flow – Use Case  Batch Apache Spark applications running every 10 - 60 minutes  Request Rate: ◦ Bursts of ~9 million requests per batch job ◦ Beginning – Reads ◦ End - Writes
  6. 6. Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N MongoDB Replica Set Spark Cluster Master Write Read
  7. 7. Spark Slave - Server Specs  Instance Type: r3.xlarge  CPU’s: 4  RAM: 30.5GB  Storage: ephemeral  Amount: 10+
  8. 8. MongoDB - Server Specs  MongoDB version: 2.6.1  Instance Type: m3.xlarge (AWS)  CPU’s: 4  RAM: 15GB  Storage: EBS  DB Size: ~500GB  Collection Indexes: 5 (4 compound)
  9. 9. The Problem  Batch jobs ◦ Should run for 5-10 minutes in total ◦ Actual - runs for ~40 minutes  Why? ◦ ~20 minutes to write with the Java mongo driver – Async (Unacknowledged) ◦ ~20 minutes to sync the journal ◦ Total: ~ 40 Minutes of the DB being unavailable ◦ No batch process response and no UI serving
  10. 10. Alternative Solutions  Shareded MongoDB (With replica sets) ◦ Pros:  Increases Throughput by the amount of shards  Increases the availability of the DB ◦ Cons:  Very hard to manage DevOps wise (for a small team of developers)  High cost of servers – because each shared need 3 replicas
  11. 11. Workflow with MongoDB Worker 1 Worker 2 …. …. … … Worker N Spark Cluster Master Write Read Master
  12. 12. Our DevOps – After that solution We had no DevOps guy at that time at all 
  13. 13. Alternative Solutions  DynamoDB (We’re hosted on Amazon) ◦ Pros:  No need to manage DevOps ◦ Cons:  Catholic Wedding Amazons Service  Not enough usage use cases  Might get to a high cost for the service
  14. 14. Alternative Solutions  Apache Cassandra ◦ Pros:  Very large developer community  Linearly scalable Database  No single master architecture  Proven working with distributed engines like Apache Spark ◦ Cons:  We had no experience at all with the Database  No Geo Spatial Index – Needed to implement by ourselves
  15. 15. The Solution  Migration to Apache Cassandra (Steps) ◦ Writing to Mongo and Cassandra simultaneously ◦ Create easily a Cassandra cluster using DataStax Community AMI on AWS ◦ First easy step – Using the spark-cassandra- connector  (Easy bootstrap move to Spark  Cassandra) ◦ Creating a monitoring dashboard to Cassandra
  16. 16. Workflow with Cassandra Worker 1 Worker 2 …. …. … … Worker N Cassandra Cluster Spark Cluster Write Read
  17. 17. Result  Performance improvement ◦ Batch write parts of the job run in 3 minutes instead of ~ 40 minutes in MongoDB  Took 2 weeks to go from “Zero to Hero”, and to ramp up a running solution that work without glitches
  18. 18. Lessons learned from a Newbi  Use TokenAwarePolicy when connecting to the cluster – Spreads the load on the coordinators Cluster cluster = null; Builder builder = Cluster.builder() .withSocketOptions(socketOptions); builder = builder.withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy())); cluster = builder.build();
  19. 19. Lessons learned from a Newbi  Monitor everything!!! – All of the Metrics ◦ Cassandra ◦ JVM ◦ OS  Feature flag every parameter to the connection, you’ll need it for tuning later
  20. 20. Monitor Everything!!!  DataStax – OpsCenter ◦ Comes bundled with the DataStax Community AMI on AWS
  21. 21. Monitor Everything!!!  Graphite + Grafana ◦ Pluggable metrics – Since Cassandra 2.0.x  Cassandra internal metrics  JVM metrics ◦ OS – Metrics  CollectD / StatsD – Reporting to graphite ◦ Should be combined with application level metrics in the same graphs  Better visibility on correlations of the metrics
  22. 22. Monitor Everything!!!  Graphite + Grafana
  23. 23. Lessons learned from a Newbi  “nodetool” is your friend ◦ tpstats, cfhistograms, cfstats…  Data Modeling ◦ Time series data ◦ Evenly distributed partitions ◦ Everything becomes more rigid  Know your queries before you model
  24. 24. Lessons learned from a Newbi  CQL Queries ◦ Once we got to know our data model better, It got more efficient performance wise to use CQL statement instead of the “spark-cassandra- connector” ◦ Prepared Statements, Delete queries (of full partitions), Range queries…
  25. 25. Useful Cassandra GUI Clients  DevCenter – By DataStax - Free  Dbeaver – Free & Open Source ◦ Supports a wide variety of databeses
  26. 26. Conclusion  Cassandra is a great linear scaling Distributed Database  Monitor as much as you can ◦ Get visibility of what’s going on in the Cluster  Data Modeling correctly is the Key for success  Be ready for your next war ◦ Cassandra performance tuning – You’ll get to that for sure
  27. 27. Questions?
  28. 28. Thanks, Resources and Contact  Demi Ben-Ari ◦ LinkedIn ◦ Twitter: @demibenari ◦ Blog: http://progexc.blogspot.com/ ◦ Email: demi.benari@gmail.com ◦ “Big Things” Community  Meetup, YouTube, Facebook, Twitter

×