Advertisement

Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

ScyllaDB
ScyllaDB
Mar. 7, 2023
Advertisement

More Related Content

Similar to Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline(20)

Advertisement

Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

  1. Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline Presented by: Peter Corless, Director of Technical Advocacy, ScyllaDB & Alexys Jacob, CTO, Numberly Moderated by: Jared Ruckle, InfoQ Editor
  2. Poll Where are you in your NoSQL adoption?
  3. Poll How much data do you under management of your transactional database?
  4. Peter Corless 4 Director of Technical Advocacy @ ScyllaDB + Listen to & share user stories + Write blogs & case studies + Play (and design) strategy & roleplaying games + @PeterCorless on Twitter
  5. + Infoworld 2020 Technology of the Year! + Founded by designers of KVM Hypervisor The Database Built for Gamechangers 5 “ScyllaDB stands apart...It’s the rare product that exceeds my expectations.” – Martin Heller, InfoWorld contributing editor and reviewer “For 99.9% of applications, ScyllaDB delivers all the power a customer will ever need, on workloads that other databases can’t touch – and at a fraction of the cost of an in-memory solution.” – Adrian Bridgewater, Forbes senior contributor + Resolves challenges of legacy NoSQL databases + >5x higher throughput + >20x lower latency + >75% TCO savings + DBaaS/Cloud, Enterprise and Open Source solutions + Proven globally at scale
  6. 6 +400 Gamechangers Leverage ScyllaDB Seamless experiences across content + devices Fast computation of flight pricing Corporate fleet management Real-time analytics 2,000,000 SKU -commerce management Video recommendation management Threat intelligence service using JanusGraph Real time fraud detection across 6M transactions/day Uber scale, mission critical chat & messaging app Network security threat detection Power ~50M X1 DVRs with billions of reqs/day Precision healthcare via Edison AI Inventory hub for retail operations Property listings and updates Unified ML feature store across the business Cryptocurrency exchange app Geography-based recommendations Global operations- Avon, Body Shop + more Predictable performance for on sale surges GPS-based exercise tracking Serving dynamic live streams at scale Powering India's top social media platform Personalized advertising to players Distribution of game assets in Unreal Engine Make marketing more relevant, effective and measurable
  7. Alexys Jacob 7 @ultrabug + CTO, Numberly + ScyllaDB awarded Open Source & University contributor + Open Source author & contributor + Apache Avro, Apache Airflow, MongoDB, MkDocs… + Tech speaker & writer + Gentoo Linux developer + Python Software Foundation contributing member Speaker Photo
  8. Numberly, Marketing Technologist 8 Digital native, Media, CRM and data has been at the heart of our business for the past 20 years. A data-driven approach to help impact engagement and sales and turn your marketing spend into a profitable investment. Optimizing your ROI is our priority, both strategic and operational. Numberly is a group with a solid financial strength. The company is listed on the stock exchange, and operates globally in 53 countries with a team of 33 nationalities. The consistency of our CSR commitments, for over 20 years, means that our group is CSR & gender equitable by design. We are convinced that parity is a key factor of strong performance and success. The recognition and loyalty of our customers is proof of this. Internationally recognized technological expertise and tool agnostic: activation on our tools (CRM, Numberly trading desk, CDP), expertise on third-party tools. R&D investments of up to 10% of our turnover to maximize your performance. The performance quality that we deliver to our customers is manifested by the range of awards received for the projects we have helped put in place. More than 500 employees brought together by a “marketing & tech mindset”. A focus on data and the quality of execution and a flexible and pragmatic approach. We pass on our passion and our know-how to our clients' teams. Digital native & Data driven Robust & International Committed & Responsible Passionate & Collaborative Tech Experts & Agnostic Innovative & Awarded Digital native, Media, CRM and data has been at the heart of our business for the past 20 years. A data-driven approach to help impact engagement and sales and turn your marketing spend into a profitable investment. Optimizing your ROI is our priority, both strategic and operational. Numberly is a group with a solid financial strength. The company is listed on the stock exchange, and operates globally in 53 countries with a team of 33 nationalities. The consistency of our CSR commitments, for over 20 years, means that our group is CSR & gender equitable by design. We are convinced that parity is a key factor of strong performance and success. The recognition and loyalty of our customers is proof of this. Internationally recognized technological expertise and tool agnostic: activation on our tools (CRM, Numberly trading desk, CDP), expertise on third-party tools R&D investments of up to 10% of our turnover to maximize your performance. The performance quality that we deliver to our customers is manifested by the range of awards received for the projects we have helped put in place. More than 500 employees brought together by a “marketing & tech mindset”. A focus on data and the quality of execution and a flexible and pragmatic approach. We pass on our passion and our know-how to our clients' teams but also our commitment in the ecosystem to defend Open Internet and the European digital sovereignty. Digital native & Data driven Robust & International Committed & Responsible Passionate & Collaborative Tech Experts & Agnostic Innovative & Awarded Paris Amsterdam New-York Dubai Montréal Londres Bruxelles Tel Aviv Lyon
  9. Agenda + The thought process to move from Python to Rust + Context, promises, arguments and decision + Learning Rust the hard way + All the stack components I had to work with in Rust + Tips, Open Source contributions and code samples + What is worth it? + Graphs, production numbers + Personal notes 9
  10. Choosing Rust over Python 10
  11. At Numberly, we move and process (a lot of) data using Kafka streams and pipelines that are enriched using ScyllaDB. processor app processor app Project context at Numberly ScyllaDB processor app raw data enriched data enriched data enriched data client app partner API business app 11
  12. processor app processor app Pipeline reliability = latency + resilience Scylla processor app raw data enriched data enriched data enriched data client app partner API business app If a processor or ScyllaDB is slow or fails, our business, partners & clients are at risk. 12
  13. A major change in our pipeline processors had to be undertaken, giving us the opportunity to redesign them entirely. The (rusted) opportunity ScyllaDB processor app raw data enriched data enriched data enriched data client app partner API business app 13
  14. “Hey, why not rewrite those 3 Python processor apps into 1 Rust app?” 14
  15. The (never tried before) Rust promises 15 A language empowering everyone to build reliable and efficient software. + Secure + Memory and thread safety as first class citizens + No runtime or garbage collector + Easy to deploy + Compiled binaries are self-sufficient + No compromises + Strongly and statically typed + Exhaustivity is mandatory + Built-in error management syntax and primitives + Plays well with Python + PyO3 can be used to run Rust from Python (or the contrary)
  16. Efficient software != Faster software + “Fast” meanings vary depending on your objectives. + Fast to develop? + Fast to maintain? + Fast to prototype? + Fast to process data? + Fast to cover all failure cases? “Selecting a programming language can be a form of premature optimization 16
  17. Efficient software != Faster software + “Fast” meanings vary depending on your objectives. + Fast to develop? Python is way faster + did that for 15 years + Fast to maintain? Very few people at Numberly do know Rust + Fast to prototype? No, code must be complete to compile and run + Fast to process data? Sure: to prove it, measure it + Fast to cover all failure cases? Definitely: mandatory exhaustivity + error handling primitives “I did not choose Rust to be “faster”. Our Python code was fast enough to deliver their pipeline processing. 17
  18. Innovation cannot exist if you don’t accept to lose time. The question is to know when and on what project. 18
  19. The Reliable software paradigms + What makes me slow will make me stronger. + Low level paradigms (ownership, borrowing, lifetimes). + Strong type safety. + Compilation (debug, release). + Dependency management. + Exhaustive pattern matching. + Error management primitives (Result). + Explicit return values (Option). 19
  20. The Reliable software paradigms + What makes me slow will make me stronger. + Low level paradigms (ownership, borrowing, lifetimes). If it compiles, it’s safe + Strong type safety. Predictable, readable, maintainable + Compilation (debug, release). Compiler is very helpful compared to random Python exceptions + Dependency management. Finally something looking sane vs Python mess + Exhaustive pattern matching. Confidence that you’re not forgetting something + Error management primitives (Result). Handle failure right from the language syntax + Explicit return values (Option). Clear separation between Some(value) and None “ I chose Rust because it provided me with the programming paradigms at the right abstraction level that I needed to finally understand and better explain the reliability and performance of my application. 20
  21. Learning Rust the hard way 21
  22. Production is not a Hello World + Learning the syntax and handling errors everywhere + Confluent Kafka + Schema Registry + Avro + Asynchronous latency-optimized design + ScyllaDB multi-datacenter + MongoDB + Kubernetes deployment + Prometheus exporter + Grafana dashboarding + Sentry Scylla processor app Confluent Kafka 22
  23. Confluent Kafka Schema Registry + Confluent Schema Registry breaks vanilla Apache Avro deserialization. + Consider using Gerard Klijs’ schema_registry_converter crate (v3+) + I discovered performance problems which we worked and have been addressed! + Latency-overhead-free manual approach: 23
  24. Apache Avro Rust was broken! + Crate apache-avro (former avro-rs) given to Apache Avro without an appointed committer. + Deserialization of complex schemas was broken... + I contributed fixes to Apache Avro (AVRO-3232+3240) + Now merged thanks to Martin Grigorov! + Make sure to use apache-avro v0.14+ + Rust compiler optimizations give a hell of a boost! + Deserializing Avro is faster than JSON! 24
  25. + Tricks to make your Kafka consumer strategy more efficient. + Deserialize your consumer messages on the consumer loop, not on green-thread tasks + Spawning a task has performance costs + Control your green-thread parallelism + Defer to green-thread tasks when I/O starts to be required task / msg Asynchronous patterns to optimize latency Kafka consumer + avro deserializer raw data task / msg task / msg task / msg task / msg Scylla enriched data 25
  26. Absorbing tail latency spikes with parallelism x16 x2 parallelism load 26
  27. Scylla Rust (shard-aware) driver + The scylla-rust-driver crate is production-ready. + Use a CachingSession to automatically cache your prepared statements + Beware: prepared queries are NOT paged, use paged queries with execute_iter() instead! + Use the latest optimized version 0.7.0! 27
  28. Exporting metrics properly for Prometheus + Effectively measuring latencies down to microseconds. + Fine tune your histogram buckets to match your expected latencies! ... 28
  29. Grafana dashboarding + Graph your precious metrics right! + ScyllaDB prepared statement cache size + Query and throughput rates + Kafka commits occurrence + Errors by type + Kubernetes pod memory + ... + Visualizing Prom Histograms max by (environment)(histogram_quantile(0.50, processing_latency_seconds_bucket{...})) 29
  30. Was it worth it? 30
  31. Did I really lose time because of Rust? + I spent more time analyzing the latency impacts of code patterns and drivers’ options than struggling with Rust syntax. + Key figures for this application: + Kafka consumer max throughput with processing? 200K msg/s on 20 partitions + Avro deserialization P50 latency? 75µs + Scylla SELECT P50 latency on 1.5B+ rows tables? 250µs + Scylla INSERT P50 latency on 1.5B+ rows tables? 660µs 31
  32. It went better than expected + Rust crates ecosystem is mature, similar to Python Package Index + 3 Python apps totalling 54 pods replaced by 1 Rust app totalling 20 pods + We helped & worked on making the scylla-rust-driver even better + Token aware policy can fallback to non-replicas for higher availability + Optimized partition key calculations for prepared statements + Expose partition key sharding to create shard-aware applications (#ScyllaSummit2023) + More to come! + This feels like the most reliable and efficient software I ever wrote! 32
  33. - Numberly’s journey to choosing ScyllaDB - Evaluating ScyllaDB for production 1/2 - Evaluating ScyllaDB for production 2/2 - Numberly’s use case: ScyllaDB to replace MongoDB+Hive (Scylla Summit 2018) - Numberly’s experience: MongoDB vs Scylla (Scylla Summit 2019) - Numberly’s contributions: Faster ScyllaDB Shard-Aware drivers (Scylla Summit 2021) - Scylla Summit 2023: Building a 100% shard aware application using Rust And of course: - ScyllaDB University Learning More 33
  34. Join our enthusiastic teams and help us face all our challenges with an innovative, benevolent and community driven mindset! - Data Engineering & Science - Software Engineering - Infrastructure We are remote friendly, so wherever you are, let’s have a chat! alexys@numberly.com Numberly is hiring 34
  35. Watch now on-demand at scylladb.com/summit 35 Questions?
  36. Thank you for joining us today. @scylladb scylladb/ slack.scylladb.com @scylladb company/scylladb/ scylladb/

Editor's Notes

  1. Welcome, everyone! My name is Peter Corless, Director of Technical Advocacy at ScyllaDB. I’ll be your host for today’s webinar -- “Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline.” Today you will learn about how Numberly moved key parts of their application code to Rust to optimize their real-time operational performance.
  2. Before we begin we are pushing a quick poll question. Where are you in your NoSQL adoption? I currently use ScyllaDB I currently use another NoSQL database I am currently evaluating NoSQL I am interested in learning more about ScyllaDB None of the above Ok, thanks for those responses. Let’s get started.
  3. Hello again everyone! I just want to take a moment for a quick audience poll. For a sense of scale, we’d like to understand: How much data do you have under management in your own transactional database systems? Less than 1 terabyte 1 to 10 terabytes 10-100 terabytes >100 terabytes Pick the answer that best matches your current data set. We’ll leave the poll up for a bit for you to answer.
  4. My name is Peter Corless, Director of Technical Advocacy at ScyllaDB. I listen to and help share user success stories.
  5. For those of you who are not familiar with ScyllaDB yet, it is the monstrously fast and scalable NoSQL database built for gamechangers. Created by the founders of the KVM hypervisor, ScyllaDB was conceived with key design characteristics to power this next tech cycle and resolve many of the challenges posed when operating distributed systems at scale. In particular, ScyllaDB is a high throughput and low latency distributed NoSQL database. Increasing database throughput (operations/second), improving P99 latency, and reducing total cost are principle drivers behind teams like yours for selecting ScyllaDB. In 2020, ScyllaDB received Infoworld’s prestigious Technology of the Year award, and it was truly an honor to be among fellow recipients like Tableau, Databricks, and Snowflake. Recently we launched ScyllaDB 5 with several new innovative features, and we have an on-demand webinar covering what’s new which I highly encourage you to watch… With such consistent innovation the adoption of our database technology has grown to over 400 key players worldwide…
  6. “Many of you will recognize some of the companies among the selection pictured here, such as Starbucks who leverage ScyllaDB for inventory management, Zillow for real-time property listing and updates, and Comcast Xfinity who power all DVR scheduling with ScyllaDB.” As it can be seen, ScyllaDB is used across many different industries and for entirely different types of use cases. Chat applications, IOT, social networking, e-commerce, fraud detection, security are some of the examples pictured in this slide. More than often, your company probably have an use case that is a perfect fit for ScyllaDB and it may be that you don’t know it yet! If you are interested in knowing how we can help you more, feel free to engage with us! To summarize, if you care about having low latencies while having high throughput for your application, we are certain that ScyllaDB is a good fit for you.
  7. Without any further ado, I have the pleasure of introducing to you our speaker today: Alexys Jacob, CTO at Numberly, who is known to the open source community as “Ultrabug.” A frequent ScyllaDB open source & ScyllaDB University contributor.
  8. Numberly is a digital data marketing technologist and expert helping brands connect and engage with their customers using all digital channels available We are proud to be an independent company with solid internationally recognized expertises in both marketing and technology, we just celebrated our 23rd anniversary
  9. I’ve been doing Python in production for more than 15 years now; That’s as much a marker that my age is advancing as well as a shock to my colleagues that I could even consider to code using another language than Python So what could trigger such a radical change on me?
  10. As a data company, we operate on a lot of data that is fast moving using an event driven approach that drives our technological choices towards platforms that allow us to process and react to stimulus as close to real time as possible We combine Kafka and Scylla extensively on streams and specialized pipeline applications that we’ll call data processors here Each of those pipeline data processor applications prepare and enrich the incoming data so that it is useful to the downstream business / partner or client applications
  11. The relevance of a data driven decision is at its best when it’s close to the event’s time of occurrence which means that availability and latency are business critical to us Those data processor apps, kafka, and of course Scylla can’t fail, if they do we get angry partners and clients (clic) Latency and resilience are thus the pillars upon which we build our business reliable platforms
  12. The data industry and ecosystems are always changing Last fall, we had to adapt three of the most demanding data processors written in Python Those processor applications were doing the job for more than 5 years, they were battle tested and trustworthy As you know, I’m not a low level programmer as I always felt C or C++ cumbersome and useless for my needs But I was following Rust maturation for a while : I was curious and had the feeling that it could find its place in between Python and C++ So when this opportunity came, I went to my colleagues and told them
  13. Hey why not rewrite those 3 python applications that we know work very well into one rust application which we don’t even know the language? (clic) After the shock, they asked for a rationale rather than just a crazy idea
  14. Rust makes promises that more and more people seem to agree with It is supposed to be… (read bullets) But furthermore, their marketing motto speaks to the marketer inside me (read) (Clic) That’s me! (Clic) That’s what my new processor app needs! Careful attendees would ask me: hey Alexys, you did not mention speed in that list Isn’t Rust supposed to be super fast? Well, their motto mention efficiency and it’s not the same as speed
  15. Efficient software does not always mean faster software Brett Cannon, a Python core developer, argues that selecting a programming language for being faster on paper is a form or premature optimization (clic) I agree with him in the sense that the word Fast conveys different meanings depending on your objectives In my opinion, Rust can be said to be faster as a consequence of being Efficient, which does not cover all the items on the list here Let’s demonstrate that on my context
  16. (read)… As we can see in my case, choosing Rust over Python will mean that I will definitely lose time (clic + read) So why would I want to lose time? The short answer is “innovation”
  17. (read) So the gist of my decision was that I was sure this project was the right one at the right time to foster innovation at Numberly
  18. Now what will I gain from losing time other than the pain of using semicolons and brackets everywhere? Supposedly a more reliable software thanks to Rust unique design and paradigms This is to say that what makes me slow is also an opportunity to make my software stronger
  19. (read)... (clic) (read)
  20. Here is an overview of all the aspects and all the technological stacks that I had to deal with (read)... Since our time is limited I will skip through this list to highlight the most insightful parts Let’s start with the first wall I hit right from the start: consuming messages from Kafka
  21. We use Confluent Kafka Community edition with its Schema Registry to structurate our Avro encoded messages in our Kafka topics The bad news is that Confluent Schema Registry adds a magic byte to kafka message payloads, which breaks vanilla Apache Avro schema deserialization Luckily for me, Gerard Klij has worked on a crate to address this problem. We worked together on improving its performance so that it’s production ready Before we fixed this I used the manual approach shown here to decode Avro messages myself with respect of their schema
  22. Then I hit the second wall when even if my reading of the Avro payload was possible I still could not deserialize them… As a total Rustian newbie, I blamed myself for days before even daring to suspect Apache Avro being the culprit I eventually read Apache Avro source code and discovered that it was broken for complex schemas like us Is anyone in the world using Rust Apache Avro in production yet? So here I am contributing fixes to the Apache Avro Rust implementation which eventually got merged three months later in January thanks to its newly appointed commiter Martin Anyway, another unexpected fact that Rust allowed me to prove is that deserializing Avro is faster than deserializing JSON in our case of rich and complex data structures. I say unexpected but my colleague Othmane was expecting it to be fair, I was happy to finally prove him right!
  23. Once I was finally able to consume messages from Kafka, I started looking at the best pattern to process them I turned to the tokio asynchronous runtime which was very intuitive coming from Python asyncio I played a lot with various code patterns to optimize and make consuming messages from Kafka latency stable and reliable One of the interesting findings was to not defer the decoding of Avro messages to a green-thread but do it right in the consumer loop Indeed, since deserialization is a CPU bound operation, it will benefit from not being cooperative with other green-thread tasks Similar, allowing and controlling your parallelism will help stabilize your I/O bound operations, let’s see a real example of that
  24. Once deserialization is done, deferring the rest of my processing that is I/O bound to green-threads helped absorb tail latencies without affecting my kafka consuming speed The Grafana dashboard you see here shows that around 9:00 something made Scylla slower than usual, scylla select and insert P95 latencies went up by 16 At the same time, you can see a bump in my parallelism load as I started having more concurrent green-threads processing messages But it only hit my kafka consuming latency by a factor of 2 at P95, effectively absorbing tail latencies due to this ephemeral overload in Scylla This is the typical example of something that was harder to pinpoint and demonstrate in Python but became clear with Rust
  25. Now to our dear Scylla I found the Scylla Rust driver to be intuitive and well featured, congratulations to the team which is also very helpful on their dedicated channel on the Scylla Slack service, join us there! The new CachingSession is very handy to cache your prepared statements so you don’t have to do it yourself like I did at first (read) beware I’m showcasing a code example of a production connection function to scylla, using SSL, multi-datacenter awareness and a caching session. Speaking of multi-datacenter awareness, I hit a bug in the Token aware load balancing that promptly got fixed by the team and released in 0.4.2
  26. Even if it’s described late in the presentation, Prometheus is actually the first thing I set up on my application so that I could measure the latency and throughput impacts of all the experiments I did For a test to be meaningful, those measurements must be made right and then graphed right. So here is an example of how I measure scylla query insertion latency The first and important gotcha is to setup your histogram bucket correctly with your expected graphing finesse Here I expect scylla latency to vary between 50µs and 15s which is the maximal server timeout I’m allowing for writes Then I use it like this: I start a timer on the histogram and record its duration on success and drop it on failure so that my metrics are not polluted by possible errors
  27. Once you measure right, you need to visualize and graph right I created a detailed and meaningful Grafana dashboard so I could see and compare the results of my Rust application experimentations, best time invested ever Make sure you graph as much things as possible (read) There are gotchas into graphing prometheus histograms so I’m linking a great article that the folks at Grafana wrote on how to visualize them right in Grafana
  28. The syntax was surprisingly simple and intuitive to adopt even coming from Python I absolutely failed to resist the temptation of testing and analyzing everything at a lower level, it was an unexpected new joy for me So in the end most of my time was spent on testing, graphing, analyzing and trying to come up with a decent and insightful explanation This surely does not look like wasted time to me! For the number hungry of you in the audience, here are some numbers taken from the application Kafka consumer max throughput with processing? 200K msg/s on 20 partitions Avro deserialization P50 latency? 75µs Scylla SELECT P50 latency on 1.5B+ rows tables? 250µs Scylla INSERT P50 latency on 1.5B+ rows tables? 2ms
  29. It went way better than expected (read)... Even if it was my first Rust application, I felt confident during the development process which transformed into confidence in a predictable and resilient software. After months of production, the new Rust pipeline processor proves to be very stable and resilient. Sentry is bored. Rust promises are living up to expectations!
  30. I selected a few articles and videos of Numberly’s experience if you want more material to deepen your understanding or widen your scope of knowledge around ScyllaDB database and ecosystem And of course, make sure to check out the excellent content from ScyllaDB University
  31. COME UP 2-3 SEED QUESTIONS FOR Q/A ORGANIC QUESTIONS FROM FIRST PRESENTATION Which resources did you use to learn rust? Great presentation, thanks! What are some resources you recommend to get started on Rust? How did you find the ramp up time for others on your team in their path to becoming proficient in rust? Has it been difficult not having a Rust expert on your team? What challenges did you encounter and any learnings? POSSIBLE SEED QUESTIONS – PROVIDED TO US FROM INFOQ TEAM You mention a few points about introducing new technologies thoughtfully. Clearly, you were convinced that this scenario was the right project and the timing was right. Did you have to go and prove that to skeptical leaders and developers? If so, how did you convince them? Are there any examples that you think would help folks in the audience make a similar case? Similarly, under "was it worth it?" you share some great personal feedback. Is it fair to say that the application telemetry indicated better performance, reliability, and stability as a result? Was the end user experience remarkably better? Is the team able to ship or learn faster as a result? What would the call to action be for folks, as a result of these learnings? To re-examine technology choices for certain scenarios? If so, what are those scenarios? As a side note, MongoDB is indeed still used in this pipeline as write-only output database which another platform still depends on, we have plans to replace it with Scylla in the future
  32. Thank you all very much for attending today. In due time, you will find this presentation available on the InfoQ and ScyllaDB website for on-demand viewing. If you would like to weigh in on what we present in the future, please Contact Us, either via the form on our website, or on Twitter. We’d love to hear your ideas. For now, on behalf of ________ and myself, and all of us at ScyllaDB, enjoy the rest of your day.
Advertisement