Cabs, Cassandra, and Hailo


Published on

The story of Cassandra use and adoption at Hailo, from development, operational and management perspective.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • First time in the land of the free!
  • Had the idea of talking about “Cassandra at Hailo”.When it came to the time to actually write the talk, I realised it was going to be quite difficult.
  • I started using Cassandra in 2010, back in version 0.6. Back then it was quite hard work.
  • I founded the London meetup group in 2010 and have been flying the C* flag over London ever since. My motivation was to connect with others who were using Cassandra. Back then “swapping war stories” was a common theme. Cassandra was not easy to use.
  • Fast forward to 2013. 7,429 commits later. Cassandra “just works”. Kudos to the team of committers and contributors who have made this happen.
  • 4:30Whilst “it just works” is quite compelling, there are still challenges to successful adoption of C* in an organisation. I am going to talk about our experiences at Hailo, from three perpsectives: dev, ops and management.
  • On iOS and Android, live in London, New York, Chicago, Toronto, Boston, Dublin, Madrid
  • My recommendation was based on the solid design principles behind C*, something I’ve talked about in the past.
  • 13:00
  • Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
  • Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
  • Read heavy, demand-driven. Writes consistent.
  • Time series for storing records of emails sent. In this instance bucketed by a daily row key, for all messages. The column name is a type 1 UUID.
  • We also denormalise for other indexes, eg: here we store every message sent to a given address under a single row.
  • More writes than reads – most of these reads are actually single entity reads.
  • Stats service – insert rate at 5k/sec. Responsible for storing business events from all areas of our system.
  • Row key = entity ID, in this instance, a 64 bit integer a-la SnowflakeColumn name = property nameValue = property valueA key point when using this pattern is to only mutate columns that you change
  • We are not using CQL.
  • We can execute AQL
  • Some screenshot
  • Some screenshot
  • 1. Most people have N years of SQL experience where N >= 5
  • 2. It’s possible to shoot yourself in the foot – but this is true of SQL (eg: joins that work with low data volumes)
  • 27:00
  • London, NYC, Tokyo, Osaka, Dublin, Toronto, Boston, Chicago, Madrid, Barcelona, Washington, Montreal
  • Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
  • Our rings, plus key stats (m1.large, 18 nodes in cluster A, 12 nodes in cluster B, 100GB per node in cluster A, ~ 600GB in cluster B)
  • Sometimes C* works too well. Clearly this cluster needs some attention, but our application is still working fine.We are probably at the point where we need a dedicated C* expert.
  • I interviewed key people from our management team to gauge their reaction to our C* deployment.
  • There is a perceptionthat we have made it much harder to get at our data. In the early days at Hailo, when we all worked in one room, developers could execute ad-hoc queries on the fly for management. Nowadays we can’t. The reasons behind this are two-fold – firstly it is true that C* is harder to execute ad-hoc queries. But that’s not the whole picture. Much of our data is still in MySQL, and the queries we used to do against this data do not run smoothly either. The perception, however, is that it is the “new database” that is the cause of problems.
  • It’s easy to cause yourself a “Big Data” problem. Developers collect and store data because they can, without being clear about the business implications.
  • With the right tools, we could change the picture completely.
  • 43:00
  • Cabs, Cassandra, and Hailo

    1. 1. #CASSANDRA13Cassandra at HailoDavid Gardner | Architect @ HailoCASSANDRASUMMIT2013
    3. 3. #CASSANDRA13 CASSANDRASUMMIT2013What is this talk about?
    6. 6. #CASSANDRA13 CASSANDRASUMMIT2013• 1,352 changed files with 235,413 additions and 47,487 deletions• 7,429 commits• 1,653 tickets completed to 1.2
    7. 7. #CASSANDRA13 CASSANDRASUMMIT2013Cassandra adoption at Hailo from three perspectives:1. Development2. Operational3. ManagementWhat this talk is about
    8. 8. #CASSANDRA13 CASSANDRASUMMIT2013What is Hailo?Hailo is The Taxi Magnet. Use Hailo to get a cab wherever you are, whenever you want.
    10. 10. #CASSANDRA13 CASSANDRASUMMIT2013• The world’s highest-rated taxi app – over 10,000 five-star reviews• Over 500,000 registered passengers• A Hailo e-hail is accepted by a driver every four seconds aroundthe world• Hailo operates in ten cities from Tokyo to Toronto in just overeighteen months of operationWhat is Hailo?
    11. 11. #CASSANDRA13 CASSANDRASUMMIT2013• Hailo is a marketplace that facilitates over $100M in run-ratetransactions and is making the world a better place for passengersand drivers• Hailo has raised over $50M in financing from the worlds bestinvestors including Union Square Ventures, Accel, the founder ofSkype (via Atomico), Wellington Partners (Spotify), Sir RichardBranson, and our CEOs mother, JaniceHailo is growing
    12. 12. #CASSANDRA13 CASSANDRASUMMIT2013The historyThe story behind Cassandra adoption at Hailo
    13. 13. #CASSANDRA13 CASSANDRASUMMIT2013Hailo launched in London in November 2011• Launched on AWS• Two PHP/MySQL web apps plus a Java backend• Mostly built by a team of 3 or 4 backend engineers• MySQL multi-master for single AZ resilience
    14. 14. #CASSANDRA13 CASSANDRASUMMIT2013Why Cassandra?• A desire for greater resilience – “become a utility”Cassandra is designed for high availability• Plans for international expansion around a single consumer appCassandra is good at global replication• Expected growthCassandra scales linearly for both reads and writes• Prior experienceI had experience with Cassandra and could recommend it
    15. 15. #CASSANDRA13 CASSANDRASUMMIT2013The path to adoption• Largely unilateral decision by developers – a result of a startupculture• Replacement of key consumer app functionality, splitting up thePHP/MySQL web app into a mixture of global PHP/Java servicesbacked by a Cassandra data store• Launched into production in September 2012 – originally justpowering North American expansion, before gradually switchingover Dublin and London
    16. 16. #CASSANDRA13 CASSANDRASUMMIT2013Development perspective
    17. 17. #CASSANDRA13 CASSANDRASUMMIT2013“Cassandra just works”Dom W, Senior Engineer
    18. 18. #CASSANDRA13 CASSANDRASUMMIT2013Use cases1. Entity storage2. Time series data
    19. 19. #CASSANDRA13 CASSANDRASUMMIT2013CF = customers126007613634425612:createdTimestamp: 1370465412email: dave@cruft.cogivenName: DavefamilyName: Gardnerlocale: en_GBphone: +447911111111
    20. 20. #CASSANDRA13 CASSANDRASUMMIT2013Considerations for entity storage• Do not read the entire entity, update one property and then writeback a mutation containing every column• Only mutate columns that have been set• This avoids read-before-write race conditions
    22. 22. #CASSANDRA13 CASSANDRASUMMIT2013CF = comms2013-06-01:55374fa0-ce2b-11e2-8b8b-0800200c9a66:{“to”:”dave@c…a48bd800-ce2b-11e2-8b8b-0800200c9a66: {“to”:”foo@ex…b0e15850-ce2b-11e2-8b8b-0800200c9a66: {“to”:”bar@ho…bfac6c80-ce2b-11e2-8b8b-0800200c9a66: {“to”:”baz@fo…
    23. 23. #CASSANDRA13 CASSANDRASUMMIT2013CF = {“to”:”dave@c…20f70a40-ce2c-11e2-8b8b-0800200c9a66: {“to”:”dave@c…2b44d3b0-ce2c-11e2-8b8b-0800200c9a66:{“to”:”dave@c…338a22f0-ce2c-11e2-8b8b-0800200c9a66: {“to”:”dave@c…
    26. 26. #CASSANDRA13 CASSANDRASUMMIT2013Considerations for time series storage• Choose row key carefully, since this partitions the records• Think about how many records you want in a single row• Denormalise on write into many indexes
    27. 27. #CASSANDRA13 CASSANDRASUMMIT2013Client libraries• Astyanax (Java)• phpcassa (PHP)• (Go)
    28. 28. #CASSANDRA13 CASSANDRASUMMIT2013Analytics• With Cassandra we lost the ability to carry out analyticseg: COUNT, SUM, AVG, GROUP BY• We use Acunu Analytics to give us this abilty in real time, for pre-planned query templates• It is backed by Cassandra and therefore highly available, resilientand globally distributed• Integration is straightforward
    29. 29. #CASSANDRA13 CASSANDRASUMMIT2013AQLSELECTSUM(accepted),SUM(ignored),SUM(declined),SUM(withdrawn)FROM AllocationsWHERE timestamp BETWEEN 1 week ago AND now’AND driver=LON123456789’GROUP BY timestamp(day)
    32. 32. #CASSANDRA13 CASSANDRASUMMIT2013Challenges
    33. 33. #CASSANDRA13 CASSANDRASUMMIT201310 Average years experienceper team memberMySQL Cassandra
    35. 35. #CASSANDRA13 CASSANDRASUMMIT2013Lessons learned
    36. 36. #CASSANDRA13 CASSANDRASUMMIT2013Have an advocate• Get someone who will sell the vision internally• Make an effort to get everyone on board
    37. 37. #CASSANDRA13 CASSANDRASUMMIT2013Learn the theory• Teach each team member the fundamentals• CQL can encourage an SQL mindset, but it’s important tounderstand the underlying data model• Make a real effort to share knowledge – keep in mind the gulf inexperience for most team members between their old world andthe new world (SQL vs NoSQL)• Peer review data models
    38. 38. #CASSANDRA13 CASSANDRASUMMIT2013Operational perspective
    39. 39. #CASSANDRA13 CASSANDRASUMMIT2013“Allows a team of 2 to achieve things theywouldn’t have considered before Cassandraexisted”Chris H, Operations Engineer
    41. 41. #CASSANDRA13 CASSANDRASUMMIT20132 clusters6 machines per region3 regions(stats cluster pending additionof third DC)OperationalClusterStatsClusterap-southeast-1 us-east-1 eu-west-1us-east-1 eu-west-1
    42. 42. #CASSANDRA13 CASSANDRASUMMIT2013AWS VPCs with OpenVPN links3 AZs per regionm1.large machinesProvisoned IOPS EBSOperationalClusterStatsCluster~ 600GB/node~ 100GB/node
    43. 43. #CASSANDRA13 CASSANDRASUMMIT2013Backups• SSTable snapshot• Used to upload to S3, but this was taking >6 hours and consumingall our network bandwidth• Now take EBS snapshot of the SSTable snapshots
    44. 44. #CASSANDRA13 CASSANDRASUMMIT2013Encryption• Requirement for NYC launch• We use dmcrypt to encrypt the entire EBS volume• Chose dmcrypt because it is uncomplicated• Our tests show a 1% performance hit in disk performance, whichconcurs with what Amazon suggest
    45. 45. #CASSANDRA13 CASSANDRASUMMIT2013Datastax Ops Centre• We run the free version• Offers up easily accessible “one screen” overviews of the activityof the entire cluster• Big fans – an easy win
    47. 47. #CASSANDRA13 CASSANDRASUMMIT2013Multi DC• Something that Cassandra makes trivial• Would have been very difficult to accomplish active-active inter-DCreplication with a team of 2 without Cassandra• Rolling repair needed to make it safe (we use LOCAL_QUORUM)• We schedule “narrow repairs” on different nodes in our clustereach night
    48. 48. #CASSANDRA13 CASSANDRASUMMIT2013Compression• Our stats cluster was running at ~1.5TB per node• We didn’t want to add more nodes• With compression, we are now back to ~600GB• Easy to accomplish• `nodetool upgradesstables` on a rolling schedule
    49. 49. #CASSANDRA13 CASSANDRASUMMIT2013Lessons learned
    51. 51. #CASSANDRA13 CASSANDRASUMMIT2013Management perspective
    52. 52. #CASSANDRA13 CASSANDRASUMMIT2013“The days of the quick and dirty are over”Simon V, EVP Operations
    53. 53. #CASSANDRA13 CASSANDRASUMMIT2013Technically, everything is fine…• Our COO feels that C* is “technically good and beautiful”, a“perfectly good option”• Our EVPO says that C* reminds him of a time series database inuse at Goldman Sachs that had “very good performance”…but there are concerns
    54. 54. #CASSANDRA13 CASSANDRASUMMIT2013People who canattempt to queryMySQLPeople who canattempt toquery Cassandra
    56. 56. #CASSANDRA13 CASSANDRASUMMIT2013Lessons learned
    57. 57. #CASSANDRA13 CASSANDRASUMMIT2013Keep the business informed• Pre-launch, we were tasked with increasing resiliency• Cassandra addressed immediate business needs, but the tradeoffs involved should have been communicated more clearly
    58. 58. #CASSANDRA13 CASSANDRASUMMIT2013Sing from the same hymn sheet• A senior founding engineer had doubts about the adoption ofCassandra until very recently• In the presence of business doubt, this lack of consistencyamongst developers exacerbated the concerns• We should have made more effort to make bilateral decisions onadoption – I don’t think this would have been hard to achieve
    59. 59. #CASSANDRA13 CASSANDRASUMMIT2013Provide solutions• There are many options for ad-hoc querying of Cassandra• We underestimated the impact of not having a good solution forthis from the very beginning
    60. 60. #CASSANDRA13 CASSANDRASUMMIT2013People who canattempt to queryMySQLPeople who canattempt toquery Cassandra
    61. 61. #CASSANDRA13 CASSANDRASUMMIT2013Conclusions
    62. 62. #CASSANDRA13 CASSANDRASUMMIT2013We like Cassandra• Solid design• HA characteristics• Easy multi-DC setup• Simplicity of operation
    63. 63. #CASSANDRA13 CASSANDRASUMMIT2013Lessons for successful adoption• Have an advocate, sell the dream• Learn the fundamentals, get the best out of Cassandra• Invest in tools to make life easier• Keep management in the loop, explain the trade offs
    64. 64. #CASSANDRA13 CASSANDRASUMMIT2013The future• We will continue to invest in Cassandra as we expand globally• We will hire people with experience running Cassandra• We will focus on expanding our reporting facilities• We aspire to extend our network (1M consumer installs, wallet)beyond cabs• We will continue to hire the best engineers in London, NYC andAsia
    65. 65. #CASSANDRA13Thank youCASSANDRASUMMIT2013