This document discusses transactional stream processing and operational state. It argues that integrating state management and stream processing within the same transactional system avoids issues caused by independent failures of separate systems and reduces the need for "glue code". It provides examples of how transactional stream processing can enable features like correlation, deduplication, and aggregation in a reliable way. Key aspects that are important for operational workloads like counting, accounting, and statistics are ensuring idempotence and implementing operations atomically within transactions.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
At Noon – The Social Learning Platform, on a daily basis we process close to 100M audio, sketch samples from more than 80K students to help measure the voice & sketch quality of our online classrooms. This talk explores the need for real time analytics in EdTech, how we built a real time analytics platform on Apache Druid & Apache Flink to provide realtime feedback on classroom quality & engagement metrics. We will also share some of the lessons we learnt along the way.
Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
At Noon – The Social Learning Platform, on a daily basis we process close to 100M audio, sketch samples from more than 80K students to help measure the voice & sketch quality of our online classrooms. This talk explores the need for real time analytics in EdTech, how we built a real time analytics platform on Apache Druid & Apache Flink to provide realtime feedback on classroom quality & engagement metrics. We will also share some of the lessons we learnt along the way.
Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
The number of deployments of Apache Kafka at enterprise scale has greatly increased in the years since Kafka’s original development in 2010. Along with this rapid growth has come a wide variety of use cases and deployment strategies that transcend what Kafka’s creators imagined when they originally developed the technology. As the scope and reach of streaming data platforms based on Apache Kafka has grown, the need to understand monitoring and troubleshooting strategies has as well.
Dustin Cote and Ryan Pridgeon share their experience supporting Apache Kafka at enterprise-scale and explore monitoring and troubleshooting techniques to help you avoid pitfalls when scaling large-scale Kafka deployments.
Topics include:
- Effective use of JMX for Kafka
- Tools for preventing small problems from becoming big ones
- Efficient architectures proven in the wild
- Finding and storing the right information when it all goes wrong
Visit www.confluent.io for more information.
Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.
Abstract:
Reactive applications need to be able to respond to demand, be elastic and ready to scale up, down, in and out—taking full advantage of mobile, multi-core and cloud computing architectures—in real time.
In this talk we will discuss the guiding principles making this possible through the use of share-nothing and non-blocking designs, applied all the way down the stack. We will learn how to deliver systems that provide reactive supply to changing demand.
I gave this talk at React Conf 2014 in London. Recording available here: https://www.youtube.com/watch?v=mBFdj7w4aFA
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Do you need Ops in your new startup? If not now, then when? And...what is Ops?
Learn how to scale ruby-based distributed software infrastructure in the cloud to serve 4,000 requests per second, handle 400 updates per second, and achieve 99.97% uptime – all while building the product at the speed of light.
Unimpressed? Now try doing the above altogether without the Ops team, while growing your traffic 100x in 6 months and deploying 5-6 times a day!
It could be a dream, but luckily it's a reality that could be yours.
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Clustrix
For high-value, high-throughput sites, downtime can cost hundreds of thousands to millions of dollars. Service architectures have baked lots of resiliency into apps, but databases and their system of record design are often vulnerable to single points of failure, bringing down entire systems. Worse still, when the database is recovered, there can be missing data. How many database transactions can your workload handle losing if your primary database goes down?
There are many strategies to minimize MySQL downtime, usually using replication and redundant hardware. Often these systems involve some manual intervention and potential downtime as failover protocols take hold. Also, these strategies may be expensive and require redundant hardware.
At Clustrix, we think there are alternative strategies that may be a better fit for modern apps in a MySQL environment.
In our final Tech Talk in this series on scaling MySQL, we evaluate multiple HA strategies. We also discuss the following topics:
- The difference between fault tolerance and high availability
- Best practices for achieving high availability with MySQL
- What are the costs of achieving HA? What can be the most cost-effective strategy?
- How is it possible to survive a multi-node failure in MySQL?
View the webcast of this Tech Talk on our YouTube channel.
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Nick Galbreath
Rate Limits at Scale SANS AppSec Las Vegas.
Rate Limit Everything All the time using a quantized time system with Memcache or Redis. Use this protect resources or discover anomalies.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafkaconfluent
The number of deployments of Apache Kafka at enterprise scale has greatly increased in the years since Kafka’s original development in 2010. Along with this rapid growth has come a wide variety of use cases and deployment strategies that transcend what Kafka’s creators imagined when they originally developed the technology. As the scope and reach of streaming data platforms based on Apache Kafka has grown, the need to understand monitoring and troubleshooting strategies has as well.
Dustin Cote and Ryan Pridgeon share their experience supporting Apache Kafka at enterprise-scale and explore monitoring and troubleshooting techniques to help you avoid pitfalls when scaling large-scale Kafka deployments.
Topics include:
- Effective use of JMX for Kafka
- Tools for preventing small problems from becoming big ones
- Efficient architectures proven in the wild
- Finding and storing the right information when it all goes wrong
Visit www.confluent.io for more information.
Netflix changed its data pipeline architecture recently to use Kafka as the gateway for data collection for all applications which processes hundreds of billions of messages daily. This session will discuss the motivation of moving to Kafka, the architecture and improvements we have added to make Kafka work in AWS. We will also share the lessons learned and future plans.
Abstract:
Reactive applications need to be able to respond to demand, be elastic and ready to scale up, down, in and out—taking full advantage of mobile, multi-core and cloud computing architectures—in real time.
In this talk we will discuss the guiding principles making this possible through the use of share-nothing and non-blocking designs, applied all the way down the stack. We will learn how to deliver systems that provide reactive supply to changing demand.
I gave this talk at React Conf 2014 in London. Recording available here: https://www.youtube.com/watch?v=mBFdj7w4aFA
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional.
In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause.
Visit www.confluent.io for more information.
Do you need Ops in your new startup? If not now, then when? And...what is Ops?
Learn how to scale ruby-based distributed software infrastructure in the cloud to serve 4,000 requests per second, handle 400 updates per second, and achieve 99.97% uptime – all while building the product at the speed of light.
Unimpressed? Now try doing the above altogether without the Ops team, while growing your traffic 100x in 6 months and deploying 5-6 times a day!
It could be a dream, but luckily it's a reality that could be yours.
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...Clustrix
For high-value, high-throughput sites, downtime can cost hundreds of thousands to millions of dollars. Service architectures have baked lots of resiliency into apps, but databases and their system of record design are often vulnerable to single points of failure, bringing down entire systems. Worse still, when the database is recovered, there can be missing data. How many database transactions can your workload handle losing if your primary database goes down?
There are many strategies to minimize MySQL downtime, usually using replication and redundant hardware. Often these systems involve some manual intervention and potential downtime as failover protocols take hold. Also, these strategies may be expensive and require redundant hardware.
At Clustrix, we think there are alternative strategies that may be a better fit for modern apps in a MySQL environment.
In our final Tech Talk in this series on scaling MySQL, we evaluate multiple HA strategies. We also discuss the following topics:
- The difference between fault tolerance and high availability
- Best practices for achieving high availability with MySQL
- What are the costs of achieving HA? What can be the most cost-effective strategy?
- How is it possible to survive a multi-node failure in MySQL?
View the webcast of this Tech Talk on our YouTube channel.
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Nick Galbreath
Rate Limits at Scale SANS AppSec Las Vegas.
Rate Limit Everything All the time using a quantized time system with Memcache or Redis. Use this protect resources or discover anomalies.
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...Amazon Web Services
We’ll share an overview of leveraging serverless architectures to support high performance data intensive applications. Fulfillment by Amazon (FBA) built the Seller Inventory Authority Platform (IAP) using Amazon DynamoDB Streams, AWS Lambda functions, Amazon Elasticsearch Service, and Amazon Redshift to improve results and reduce costs. Scopely will share how they used a flexible logging system built on Kinesis, Lambda, and Amazon Elasticsearch to provide high-fidelity reporting on hotkeys in Memcached and DynamoDB, and drastically reduce the incidence of hotkeys. Both of these customers are using managed services and serverless architecture to build scalable systems that can meet the projected business growth without a corresponding increase in operational costs.
Using Time Series for Full Observability of a SaaS PlatformDevOps.com
Aleksandr Tavgen from Playtech, the world’s largest online gambling software supplier, will share how they are using InfluxDB 2.0, Flux, and the OpenTracingAPI to gain full observability of their platform. In addition, he will share how InfluxDB has served as the glue to cope with multiple sets of time series data.
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
Prometheus is a next-generation monitoring system with a time series database at it's core. Once you have a time series database, what do you do with it though? This talk will look at getting data in, and more importantly how to use the data you collect productively.
Contact us at prometheus@robustperception.io
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Respond to and troubleshoot production incidents like an saTom Cudd
So it's 4 AM and you just got a call from a panicked executive that the system is down! Oh noes! What do you do? Troubleshoot LIKE AN SA. I know "Systems Administrator" is not the cool industry term anymore, but that mentality for fixing the big live problem, like RIGHT NOW can still help today.
You're probably in the job you're in because you're AWESOME at figuring out what's wrong and fixing problems. But your projects have grown, your team has grown, and the expectations grow with them. How do you deal with these new found responsibilities? LIKE AN SA. There are some simple processes you can put in place to help make your life easier. We'll discuss a framework for incident response, a step-by-step guide for troubleshooting production issues, and how to then learn from these outages to prevent problems from happening again.
Event-Driven Architectures Done Right | Tim Berglund, ConfluentHostedbyConfluent
Far from a controversial choice, Kafka is now a technology developers and architects are adopting with enthusiasm. And it’s often not just a good choice, but a technology enabling meaningful improvements in complex, evolvable systems that need to respond to the world in real time. But surely it's possible to do wrong! In this talk, we'll look at common mistakes in event-driven systems built on top of Kafka:
- Deploying Kafka when an event-driven architecture is not the best choice.
- Ignoring schema management. Events are the APIs of event-driven systems!
- Writing bespoke consumers when stream processing is a better fit.
- Using stream processing when you really need a database.
- Trivializing the task of elastic scaling in all parts of the system.
It's highly likely for medium- and large-scale systems that an event-first perspective is the most helpful one to take, but it's early days, and it's still possible to get this wrong. Come to this talk for a survey of mistakes not to make.
It covers general problem of creating monitoring and observability without killing your Ops motivation team with False Positives and unexplained alerts.
Problems on this side, pitfalls, anti-patterns, and how to make it right.
How to manage a monitoring zoo. Spaghettification of dashboards. Why Uber needs 9 billion metrics (¯\_(ツ)_/¯) and why this is antipattern. Metrics as a stream of data. We talk about new Flux language from InfluxDb. A bit of time series analysis and defining of pipelines in Flux for metrics data. Drunkyard walk on your metrics or why to measure a randomness.
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...InfluxData
Aleksandr Tavgen from Playtech, the world’s largest online gambling software supplier, will share how they are using InfluxDB 2.0, Flux, and the OpenTracingAPI to gain full observability of their platform. In addition, he will share how InfluxDB has served as the glue to cope with multiple sets of time series data, especially in the case of understanding online user activity — a use case that is normally difficult without the math functions now available with Flux.
(DVO205) Monitoring Evolution: Flying Blind to Flying by InstrumentAmazon Web Services
Today, AdRoll runs its infrastructure by instrumentation: constantly asking empirical questions, analyzing data for answers, and designing new features with instrumentation in mind to understand how functionality will work upon release. AdRoll’s development methodology did not start out this way, however. It took a cultural shift and many new tools and processes to adopt this approach. In this session, AdRoll and Datadog will discuss how to evolve your organization from a state of “flying blind” to a culture focused on monitoring and data-based decisions. Session sponsored by Datadog.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Globus Connect Server Deep Dive - GlobusWorld 2024
Transactional Streaming: If you can compute it, you can probably stream it.
1. Transactional Streaming
If you can compute it, you can probably stream it.
John Hugg
March 30th, 2016
@johnhugg / jhugg@voltdb.com
2. Who Am I?
• First developer on the VoltDB project.
• Previously at Vertica and other data
startups.
• Have made so many bad decisions
over the years, that now I almost know
what I'm talking about.
• jhugg@voltdb.com
• @johnhugg
• http://chat.voltdb.com
4. Operations at Scale
• Ingest data from several sources into a horizontally scalable system.
• Process data on arrival
(i.e., transform, correlate, filter, and aggregate data).
• Understand, act, and record.
• Push relevant data to a downstream, big data system.
11. One Size
Fits All
• Analytics and operational
stateful stores require
different storage engines to
be optimal.
Columns vs. Rows
Vertica vs. VoltDB
• Machine Learning
Multi-Dim Math
Search
• Microservices?
• Data Value?
13. What’s the Difference?
• Non-integrated systems means you write glue code, or you use
someone’s glue code.
• Operational glue code is different from batch-oriented glue code.
• Batch or OLAP has huge safety nets for glue code:
• HDFS, CSV, immutable data sets
• “Blow it away and reload”
• Much less time pressure
14. Glue Glue
You wrote this.
1 User.
Tested Well
1000s of users
Tested Well
1000s of users
Tested Well
1000s of users
Community Supplied
Many Users
15. But I’m not writing “glue code”
“I’m just using the well-tested Cassandra driver in my Storm code.”
• You’re using a computer network. They are not always reliable.
• Storm might fail in the middle of processing.
• Cassandra might fail in the middle of processing.
• Both systems are tested for this, but not together, using your glue
code.
18. Use the same system for
state and processing.
Ensures they are tested together.
No independant failures.
19. 1 Transaction = 1 Event
ACID
• Atomic: Either 100% done or 0% done. No in-between.
• (Consistent)
• Isolated: Two concurrent operations can’t interfere with each other
• Durable: If it says it’s done, then it is done.
26. Call Center Management
Events
• “Begin Call”
Calling Number, Agent Id, Start Time, etc…
• “End Call”
Calling Number, Agent Id, End Time, etc…
27. What Kind of Problems
• Correlation - Streaming Join
• Out-of-order delivery
• At least once delivery - How to dedup
• Generate new event on call completion - once
• Precise Accounting
• Precise Stats - Event time vs processing time
31. Schema for Call Center Example
CREATE TABLE opencalls
(
call_id BIGINT NOT NULL,
agent_id INTEGER NOT NULL,
phone_no VARCHAR(20 BYTES) NOT NULL,
start_ts TIMESTAMP DEFAULT NULL,
end_ts TIMESTAMP DEFAULT NULL,
PRIMARY KEY (call_id, agent_id, phone_no)
);
CREATE TABLE completedcalls
(
call_id BIGINT NOT NULL,
agent_id INTEGER NOT NULL,
phone_no VARCHAR(20 BYTES) NOT NULL,
start_ts TIMESTAMP NOT NULL,
end_ts TIMESTAMP NOT NULL,
duration INTEGER NOT NULL,
PRIMARY KEY (call_id, agent_id, phone_no)
);
Unpaired call begin/end events
Can arrive in any order
Any match transactionally
moves to the completed
calls table
33. is the property of certain operations in
mathematics and computer science, that can be
applied multiple times without changing the
result beyond the initial application.
Idempotence
34. Idempotent Not Idempotent
set x = 5;
same as
set x = 5; set x = 5;
x++;
not same as
x++; x++;
if (x % 2 == 0) x++;
same as
if (x % 2 == 0) x++;
if (x % 2 == 0) x++;
if (x % 2 == 0) x *= 2;
not same as
if (x % 2 == 0) x *= 2;
if (x % 2 == 0) x *= 2;
spill coffee on brown pants eat whole plate of spaghetti
36. How to make BeginCall Idempotent?
• If call record is in completed calls,
ignore.
• If the call record is in open calls and is
missing end time, ignore.
• If call record is in open calls, check if
this event completes the call.
Yes, handle swapped begin & end
• Otherwise, create an new record in
open calls table.
open calls
completed calls
Tables
37. How to make BeginCall Idempotent?
• If call record is in completed calls,
ignore.
• If the call record is in open calls and is
missing end time, ignore.
• If call record is in open calls, check if
this event completes the call.
Yes, handle swapped begin & end
• Otherwise, create an new record in
open calls table.
open calls
completed calls
TablesIdempotency
38. • If call record is in completed calls,
ignore.
• If the call record is in open calls and is
missing end time, ignore.
• If call record is in open calls, check if
this event completes the call.
Yes, handle swapped begin & end
• Otherwise, create an new record in
open calls table.
This thing to the left
is a transaction.
42. Processing Code
for a Single Event
Database / State
Processing Code
for a Single Event
Not Isolated
43. Counting
Systems with single-key
consistency
Systems with special features
to enable counters
ACID transactional systems
Systems that enforce a
single writer
As we say in
New England…
Performance is
wicked variable.
Not “Read Committed”
44. Accounting
• Accounting is just counting, but more so.
• Need to be able to increment by amount (or decrement).
• Often need to increment/decrement things in groups.
45. Accounting
• When gamer buys a Mystical Sword of Hegemony, update the following:
• Debit the gamer’s rubies or whatever.
• Update real-world region stats, like swords sold in gamer’s geo-
region, total money spent in gamer’s geo-region etc…
• Update game region stats for the current game location, say the
“Tar Shoals of Dintymoore”, like number of MSoHs in the region.
• Increment any offer-related stats, like record whether the MSoH was
offered because of customer engagement algorithm X15 or B12.
47. Accounting
Systems with single-key
consistency
Systems with special features
to enable counters
ACID transactional systems
Systems that enforce a
single writer
As we say in
New England…
Performance is
wicked variable.
?
48. Last Dollar Problem
• Ad-Tech app wants to show a user an ad from a campaign.
• The price of the ad is $0.90.
• Advertiser has $1.00 campaign budget left.
• If the budget check and the display aren’t ACID, it’s possible to
decide to show the ad twice.
• Ad-Tech app is forced to choose between over or under-billing.
49. Aggregation
• Aggregation is just counting and accounting that the system does
for you.
• Often this is counting chopped up by groups.
• Eg. Sword sales by region. % success by offer.
• In Call Center, it could be average call length by agent.
50. Accounting Aggregation
Systems with single-key
consistency
Systems with special features
to enable counters
ACID transactional systems
Systems that enforce a
single writer
As we say in
New England…
Performance is
wicked variable.
?
51. How to Aggregate Without Consistency?
• Use a stand-alone stream processor.
• Best fit for aggregation by time, and specifically by processing
time, not event time.
• Run a query on all the data every time you want the aggregation.
• BOO!
52. Actual Math
What’s the mean and standard deviation of call length
chopped up various ways?
55. The Details (mostly) Don’t Matter
• Still need to think about performance and likely horizontal
partitioning of work.
• Integration of State & Processing + Full ACID Transactions
=> I can program this math without thinking about:
• Failure
• Interference from weak isolation.
• Partial Visibility to State
58. Low Latency Can Affect the Decision
500ms
Want to be here You lose money here
59. Get Into the “Fast Path”
• Policy Enforcement in Telco
• Fraud Detection “Smoke Tests”
• Change what a user sees in response to action:
• Change the next webpage content based on recent
website actions.
• Pick what’s behind the magic door based on how the
game is going.
62. When Imperfect is Enough
• Before: No metadata. Maintenance works on stuff based on their
experience, schedules and visual inspection.
• Now: Basic stream processing system is up 99% of the time, and
provides a much richer guidance to maintenance.
Robots fail less often and cost less to operate.
• Possible Future: More sophisticated stream processing is up
99.99% of the time and offers even more insight.
Robots fail a tiny bit less often and costs are a tiny bit down.
63. When Imperfect Isn’t Worth It
Probability of Failure
(under system X)
Expected Average
Failure Cost
# of Operations x xCost of System X +
• I’ve worked on Ad-Tech use cases => High # Operations
• Complex Multi-Cluster/System Monsters => High % failure
• Billing systems and fraud systems => High cost per failure
Licenses
Hardware
Engineering
(Switching Tech)
64. More consistent
systems don’t have to
be more expensive
Easier to develop => Less Engineering
More Efficient => Less Hardware
65. Conclusion - Thank You!
• Operations => Integration Wins
Analytics, Batch => Use Specialized Tools
• With transactions, complex math becomes
mostly typing.
• Many of these problems can be solved without
transactional streaming, but…
• It’s going to be harder
• It might be less accurate
BS
Stuff I Don't Know
Stuff I Know
T H I S TA L K
http://chat.voltdb.com
@johnhugg
jhugg@voltdb.com
all images from wikimedia w/ cc license unless otherwise noted