This document summarizes Loggly's transition from their first generation log management infrastructure to their second generation infrastructure built on Apache Kafka, Twitter Storm, and ElasticSearch on AWS. The first generation faced challenges around tightly coupling event ingestion and indexing. The new system uses Kafka as a persistent queue, Storm for real-time event processing, and ElasticSearch for search and storage. This architecture leverages AWS services like auto-scaling and provisioned IOPS for high availability and scale. The new system provides improved elasticity, multi-tenancy, and a pre-production staging environment.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
A talk given on 2018-06-16 in HK Open Source Conference 2018.
The rise of the Apache Kafka starts a new generation of data pipeline - the stream-processing pipeline.
In this talk, Dr. Mole Wong will walk you through the concept of the stream-processing data pipeline, and how this data pipeline can be set up. He will also discuss the use cases of such a data pipeline.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
A talk given on 2018-06-16 in HK Open Source Conference 2018.
The rise of the Apache Kafka starts a new generation of data pipeline - the stream-processing pipeline.
In this talk, Dr. Mole Wong will walk you through the concept of the stream-processing data pipeline, and how this data pipeline can be set up. He will also discuss the use cases of such a data pipeline.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
The Netflix Way to deal with Big Data ProblemsMonal Daxini
Netflix is a data driven company with a unique culture. Come take a holistic tour of the Big Data ecosystem, and how Netflix culture catalyzes the development of systems. Then ogle at how we quickly evolved and scaled the event pipeline to a 1 trillion events per day and over 1.4 PB of event data without service disruption, and a small team.
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality.
Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini
Netflix is a data driven company and we process over 700 billion streaming events per day with at-least once processing semantics in the cloud. To enable extracting intelligence from this unbounded stream easily we are building Stream Processing as a Service (SPaaS) infrastructure so that the user can focus on extracting value and not have to worry about boilerplate infrastructure and scale.
We will share our experience in building a scalable SPaaS using Flink, Apache Beam and Kafka as the foundation layer to process over 1.3 PB of event data without service disruption.
The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5
It provides a brief introduction to the motivation for building Kafka and how it works from a high level.
Please download the presentation if you wish to see the animated slides.
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low (or high) is needed?
* High volume: How much volume must be handled?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
URP? Excuse You! The Three Metrics You Have to Know confluent
(Todd Palino, LinkedIn) Kafka Summit SF 2018
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites?
* Kafka without access controls
* Multitenant clusters with no capacity controls
* Worrying about message schemas
* MirrorMaker inefficiencies
* Hope and pray log compaction
* Configurations as shared secrets
* One-way upgrades
We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
At Dropbox we are currently handling approximately 10,000,000 messages per second at peak across our handful of Kafka clusters. The largest of which has hit throughputs of 7,000,000 per second (~30 Gbps) on only 20 nodes. We’ll walk you through the steps we took to get where we are, the design that works for us — and those that didn’t. We’ll talk about the tooling we had to build and what we want to see exist.
We’ll dive deeper into configuration and provide a blueprint you can follow. We’ll talk about the trials and tribulations of using Kafka — including ways we’ve set our clusters on fire, ways we’ve lost data, ways we’ve turned our hairs gray, and ways we’ve heroically saved the day for our users. Finally, we’ll spend time on some of the work we’re doing to handle consumer coordination across our many different systems and to integrate Kafka into a well established corporate infrastructure. (I.e., making Kafka “”play nice”” with everybody.)
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...StreamNative
Data insights and data-driven strategies create the competitive differentiators companies thrive off today. The need for unified messaging and streaming has never been more apparent.
Pulsar started with the goal of building a global, geo-replicated infrastructure to serve Yahoo!’s messaging needs. With the increased need to process both business events (such as payment request, billing request) and operational events (such as log data, click events, etc), the team at Yahoo! set out to build a true unified infrastructure platform to handle all in-motion data. That technology became Apache Pulsar.
In this talk, Matteo Merli and Sijie Guo will dive into the landscape of unified messaging and streaming, how Pulsar helps companies achieve this vision, and what the future of Pulsar will look like.
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
Should you consume Kafka in a stream OR batch? When should you choose each one? What is more efficient, and cost effective?
In this talk we’ll give you the tools and metrics to decide which solution you should apply when, and show you a real life example with cost & time comparisons.
To highlight the differences, we’ll dive into a project we’ve done, transitioning from reading Kafka in a stream to reading it in batch.
By turning conventional thinking on its head and reading our multi-petabyte Kafka stream in batch using Spark and Airflow, we’ve achieved a huge cost reduction of 65% while at the same time getting a more scalable and resilient solution.
We’ll explore the tradeoffs and give you the metrics and intuition you’ll need to make such decisions yourself.
We’ll cover:
Costs of processing in stream compared to batch
Scaling up for bursts and reprocessing
Making the tradeoff between wait times and costs
Recovering from outages
And much more…
06/18/2014 - Billing & Payments Engineering Meetup @ Netflix
For this Meetup, we have invited speakers from several tech companies to give a series of lightening talks on challenges related to billing & payments systems.
This event is for engineers who are interested in learning more about billing & payments systems. No previous experience with this kind of system is required to attend.
Presenters:
- Mathieu Chauvin - Engineering Manager for Payments @ Netflix
- Taylor Wicksell - Sr. Software Engineer for Billing @ Netflix
- Jean-Denis Greze - Engineer @ Dropbox
- Alec Holmes - Software Engineer @ Square
- Emmanuel Cron - Software Engineer III, Google Wallet @ Google
- Paul Huang - Engineering Manager @ Survey Monkey
- Anthony Zacharakis - Lead Engineer @ Lumos Labs
- Shengyong Li / Feifeng Yang - Dir. Engineering Commerce / Tech Lead Payment @ Electronic Arts
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
My most important lesson from working on 7+ cloud-based products:
“Failing to prepare for failure is costly... but failing to prepare for success can be even worse”.
Learn from my experience - read this deck to learn the top 6 SaaS mistakes you should avoid.
Rumble Entertainment GDC 2014: Maximizing Revenue Through LoggingSolarWinds Loggly
How do you balance the dynamic needs of millions of concurrent users, support hundreds of developers, optimize a distributed cloud environment, push code releases daily, and stay sane? It starts with visibility into all your data and real-time insight from your log files.
In this session at the 2014 Game Developers Conference, Albert Ho provided a step-by-step outline of how Rumble Entertainment maximizes revenue in online games by:
Instrumenting its games and platform to facilitate global troubleshooting
Gathering insight from log files to identify, quantify and solve operational issues
Conducting transactional tracing starting from player feedback on open social networks to isolate the source, server and all users affected by the issue for remediation
Baking Loggly into the DNA of its games to serve as a single source of truth and extended it to how we work with partners
Correlating real-time visibility directly to revenue to speed decision making and planning
The Netflix Way to deal with Big Data ProblemsMonal Daxini
Netflix is a data driven company with a unique culture. Come take a holistic tour of the Big Data ecosystem, and how Netflix culture catalyzes the development of systems. Then ogle at how we quickly evolved and scaled the event pipeline to a 1 trillion events per day and over 1.4 PB of event data without service disruption, and a small team.
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality.
Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Building Event-Driven Systems with Apache KafkaBrian Ritchie
Event-driven systems provide simplified integration, easy notifications, inherent scalability and improved fault tolerance. In this session we'll cover the basics of building event driven systems and then dive into utilizing Apache Kafka for the infrastructure. Kafka is a fast, scalable, fault-taulerant publish/subscribe messaging system developed by LinkedIn. We will cover the architecture of Kafka and demonstrate code that utilizes this infrastructure including C#, Spark, ELK and more.
Sample code: https://github.com/dotnetpowered/StreamProcessingSample
Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini
Netflix is a data driven company and we process over 700 billion streaming events per day with at-least once processing semantics in the cloud. To enable extracting intelligence from this unbounded stream easily we are building Stream Processing as a Service (SPaaS) infrastructure so that the user can focus on extracting value and not have to worry about boilerplate infrastructure and scale.
We will share our experience in building a scalable SPaaS using Flink, Apache Beam and Kafka as the foundation layer to process over 1.3 PB of event data without service disruption.
The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5
It provides a brief introduction to the motivation for building Kafka and how it works from a high level.
Please download the presentation if you wish to see the animated slides.
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low (or high) is needed?
* High volume: How much volume must be handled?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
URP? Excuse You! The Three Metrics You Have to Know confluent
(Todd Palino, LinkedIn) Kafka Summit SF 2018
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites?
* Kafka without access controls
* Multitenant clusters with no capacity controls
* Worrying about message schemas
* MirrorMaker inefficiencies
* Hope and pray log compaction
* Configurations as shared secrets
* One-way upgrades
We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.
Deploying Kafka at Dropbox, Mark Smith, Sean Fellowsconfluent
At Dropbox we are currently handling approximately 10,000,000 messages per second at peak across our handful of Kafka clusters. The largest of which has hit throughputs of 7,000,000 per second (~30 Gbps) on only 20 nodes. We’ll walk you through the steps we took to get where we are, the design that works for us — and those that didn’t. We’ll talk about the tooling we had to build and what we want to see exist.
We’ll dive deeper into configuration and provide a blueprint you can follow. We’ll talk about the trials and tribulations of using Kafka — including ways we’ve set our clusters on fire, ways we’ve lost data, ways we’ve turned our hairs gray, and ways we’ve heroically saved the day for our users. Finally, we’ll spend time on some of the work we’re doing to handle consumer coordination across our many different systems and to integrate Kafka into a well established corporate infrastructure. (I.e., making Kafka “”play nice”” with everybody.)
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...StreamNative
Data insights and data-driven strategies create the competitive differentiators companies thrive off today. The need for unified messaging and streaming has never been more apparent.
Pulsar started with the goal of building a global, geo-replicated infrastructure to serve Yahoo!’s messaging needs. With the increased need to process both business events (such as payment request, billing request) and operational events (such as log data, click events, etc), the team at Yahoo! set out to build a true unified infrastructure platform to handle all in-motion data. That technology became Apache Pulsar.
In this talk, Matteo Merli and Sijie Guo will dive into the landscape of unified messaging and streaming, how Pulsar helps companies achieve this vision, and what the future of Pulsar will look like.
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
Should you consume Kafka in a stream OR batch? When should you choose each one? What is more efficient, and cost effective?
In this talk we’ll give you the tools and metrics to decide which solution you should apply when, and show you a real life example with cost & time comparisons.
To highlight the differences, we’ll dive into a project we’ve done, transitioning from reading Kafka in a stream to reading it in batch.
By turning conventional thinking on its head and reading our multi-petabyte Kafka stream in batch using Spark and Airflow, we’ve achieved a huge cost reduction of 65% while at the same time getting a more scalable and resilient solution.
We’ll explore the tradeoffs and give you the metrics and intuition you’ll need to make such decisions yourself.
We’ll cover:
Costs of processing in stream compared to batch
Scaling up for bursts and reprocessing
Making the tradeoff between wait times and costs
Recovering from outages
And much more…
06/18/2014 - Billing & Payments Engineering Meetup @ Netflix
For this Meetup, we have invited speakers from several tech companies to give a series of lightening talks on challenges related to billing & payments systems.
This event is for engineers who are interested in learning more about billing & payments systems. No previous experience with this kind of system is required to attend.
Presenters:
- Mathieu Chauvin - Engineering Manager for Payments @ Netflix
- Taylor Wicksell - Sr. Software Engineer for Billing @ Netflix
- Jean-Denis Greze - Engineer @ Dropbox
- Alec Holmes - Software Engineer @ Square
- Emmanuel Cron - Software Engineer III, Google Wallet @ Google
- Paul Huang - Engineering Manager @ Survey Monkey
- Anthony Zacharakis - Lead Engineer @ Lumos Labs
- Shengyong Li / Feifeng Yang - Dir. Engineering Commerce / Tech Lead Payment @ Electronic Arts
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
My most important lesson from working on 7+ cloud-based products:
“Failing to prepare for failure is costly... but failing to prepare for success can be even worse”.
Learn from my experience - read this deck to learn the top 6 SaaS mistakes you should avoid.
Rumble Entertainment GDC 2014: Maximizing Revenue Through LoggingSolarWinds Loggly
How do you balance the dynamic needs of millions of concurrent users, support hundreds of developers, optimize a distributed cloud environment, push code releases daily, and stay sane? It starts with visibility into all your data and real-time insight from your log files.
In this session at the 2014 Game Developers Conference, Albert Ho provided a step-by-step outline of how Rumble Entertainment maximizes revenue in online games by:
Instrumenting its games and platform to facilitate global troubleshooting
Gathering insight from log files to identify, quantify and solve operational issues
Conducting transactional tracing starting from player feedback on open social networks to isolate the source, server and all users affected by the issue for remediation
Baking Loggly into the DNA of its games to serve as a single source of truth and extended it to how we work with partners
Correlating real-time visibility directly to revenue to speed decision making and planning
Изготавливаем презентации для компаний, проектов, услуг, продуктов. СТАТУСНЫЕ «О компании» - визуальной форме рассказ о Вашем профессиональных преимуществах. В результате – утверждается статус и признание компании, достигается доверие к фирме, как к надежному и солидному партнеру.
ИНФОРМАЦИОННЫЕ «О продукте» - это прицельное сообщение о достоинствах и ценностях Вашего продукта/услуги, и перечень основных причин для гордости у покупателей по использованию Вашего товара.
ПРОЕКТНЫЕ «О проекте» – рассеять неизвестность, быть понятым, увеличить клиентов, познакомить и убедить участников или инвесторов данного мероприятия.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Шаблон презентации о Компании в PowerPoint (B2B)Слайдстор
Шаблон презентации содержит логически выстроенные блоки, с помощью которых можно представить основную информацию о компании, рассказать о продуктах и услугах, истории создания и развития, менеджменте, бизнес-модели и положении компании в отрасли и на рынке. Важным элементом презентации является возможность показать географическое размещение активов на карте России.
Данная презентация может служить хорошим примером структурирования информации о компании для представления клиентам, партнерам или инвесторам.
Introduction to SlideShare for BusinessesSlideShare
As the global hub of professional content, SlideShare can help you or your business amplify its reach, get discovered by targeted audiences and capture more professional opportunities. Learn why you should use SlideShare for your business
ZENITH TECHNOLOGIES provides innovative, reliable and technically advanced process measurement, automation and MES solutions to the process, pharma/biotechnology, food and electronic sectors.
Developing the right solution requires a special mix of engineering and application expertise. Our team of experienced professionals delivers tangible results right through the project lifecycle.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Scality CTO Giorgio Regni and Software Engineer Lauren Spiegel talk about the open source S3 clone, written in Node.js. This presentation was given at a meetup on September 1, 2016 in San Francisco.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
We're talking about serious log crunching and intelligence gathering with Elastic, Logstash, and Kibana.
ELK is an end-to-end stack for gathering structured and unstructured data from servers. It delivers insights in real time using the Kibana dashboard giving unprecedented horizontal visibility. The visualization and search tools will make your day-to-day hunting a breeze.
During this brief walkthrough of the setup, configuration, and use of the toolset, we will show you how to find the trees from the forest in today's modern cloud environments and beyond.
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
Delivery of a new Bio-informatics infrastructure at the Wellcome Trust Sanger Center. We include how to programatically create, manage and provide providence for images used both at Sanger and elsewhere using open source tools and continuous integration.
Real Time Insights for Advertising TechApache Apex
A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several Data centers across the world. In batch data processing, data is collected at different geographic locations and processed at regular intervals. This system brings delay of at least 1 hour before an event is accounted for.
The goal of having real time streaming was to provide publishers, Demand Side Platforms (DSP's) and agencies actionable insights in a few minutes from the time of event generation.
This Ad Tech company uses DataTorrent RTS powered by Apex for:
• Real time reporting
• Resource monitoring
• Real time learning
• Allocation engine
Tushar Gosavi from DataTorrent will take the audience through the architecture, custom operators developed, use cases for real time and the challenges involved in implementing streaming systems at scale where multiple data centers are in play.
Tushar is a Senior Engineer at DataTorrent and has worked in distributed systems and storage domains.
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
Learn how to monitor your database performance closely and troubleshoot database issues quickly using a variety of features provided by Amazon RDS and MySQL including database events, logs, and engine-specific features. You also learn about the security best practices to use with Amazon RDS for MySQL. In addition, you learn about how to effectively move data between Amazon RDS and on-premises instances. Lastly, you learn the latest about MySQL 5.6 and how you can take advantage of its newest features with Amazon RDS.
Loggly - Tools and Techniques For Logging MicroservicesSolarWinds Loggly
The microservice architecture is taking the tech world by storm. A growing number of businesses are turning towards microservices as a way of handling large workloads in a distributed and scalable way. In these slides, we’ll look at methods for logging microservices and the tools that make them possible.
Several logging frameworks promise .NET coders a “five-minute setup and start logging” solution for their projects. We decided to test these frameworks to identify speed and convenience between them.
Loggly - Case Study - Loggly and Docker Deliver Powerful Monitoring for XAPPm...SolarWinds Loggly
With Loggly, XAPPmedia:
• Standardizes its diagnostic approach with the combination of Docker and Loggly
• Accelerates resolution of software and non-software issues
• Employs proactive alerting to prevent issues from affecting the business
Loggly - Case Study - Stanley Black & Decker Transforms Work with Support fro...SolarWinds Loggly
With Loggly, Stanley Black & Decker:
• Provides team with troubleshooting capabilities for mobile and IoT applications running on traditional and serverless architectures
• Supports performance monitoring, security, and PCI compliance needs
• Enables quick scalability as new innovations are launched
Loggly - Case Study - Loggly and Kubernetes Give Molecule Easy Access to the ...SolarWinds Loggly
With Loggly, Molecule Software:
• Cuts expenses by replacing ELK stack running on EC2 with Loggly
• Accelerates troubleshooting with Loggly Dynamic Field Explorer™
• Reduces the customer impact from issues with proactive alerting
Loggly - Case Study - Datami Keeps Developer Productivity High with LogglySolarWinds Loggly
With Loggly, Datami:
• Saves time and increases productivity with easier troubleshooting
• Keeps developers focused on coding
• Improves quality assurance and performance in commercial trials
• Effectively monitors new deployments
Loggly - Case Study - BEMOBI - Bemobi Monitors the Experience of 500 Million ...SolarWinds Loggly
Bemobi uses Loggly to:
• Gains visibility and troubleshooting capabilities for applications running on autoscaling
Amazon EC2 environments and serverless AWS Lambda compute services
• Accelerates response times with proactive alerting
• Maintains QoS agreements critical to customer billing with log-based reporting
Loggly - How to Scale Your Architecture and DevOps Practices for Big Data App...SolarWinds Loggly
Loggly's own Manoj Chaudhary gave this presentation at AWS DevOps Week 2017 on scaling architecture and DevOps practices as it relates to a real-life Big Data application - Loggly!
Loggly compared performance and reliability among three popular debug logging libraries as well as two popular Express request logging libraries. Centralize your Node.js logs with Loggly!
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. What Loggly Does
• Log Management as a service
– Near real-time indexing of events
• Distributed architecture, built on AWS
• Initial production services in 2011
– Loggly Generation 2 released in Sept 2013
• Thousands of customers
3. Agenda for this Presentation
•
•
•
•
•
•
A bit about logging
Lessons learned from our first generation
How we leverage AWS services
Our use of Kafka, Storm, ElasticSearch
What worked well for us and what did not
Where we are going
4. Log Management
• Everyone starts with …
– A bunch of log files (syslog, application specific)
– On a bunch of machines
• Management consists of doing the simple stuff
– Rotate files, compress and delete
– Information is there but awkward to find specific events
– Weird log retention policies evolve over time
5. “…how can I make this someone else’s problem!”
“…hmmm, our logs are getting a bit bloated”
Log Volume
Self-Inflicted Pain
“…let’s spend time managing our log capacity”
6. Best Practices in Log Management
• Use existing logging infrastructure
– Real time syslog forwarding is built in
– Application log file watching
• Store logs externally
– Accessible when there is a system failure
• Log messages in machine parsable format
– JSON encoding when logging structured information
– Key-value pairs
7. From the Trenches…
• Managing Applications vs. Managing Logs
– Do not make this is an either/or proposition!
If you get a disk space alert, first login…
% sudo rm –rf /var/log/apache2/*!
Admit it, we’ve all seen this kind of thing!
12. Loggly First Generation
• Logging as a service
– Real-time searchable logs
• Thousands of customers
– Transmission rates from 10 events/sec to 100k events/sec
– When customers systems are busy they send more logs
– Log traffic has distinct bursts; bursts can last for several hours
• AWS EC2 deployment
– We used EC2 Instance storage
• SOLR Cloud
– Full power of Lucene search
– Tens of thousands of shards (with special ‘sleep shard’ logic)
• ZeroMQ for message queue
13. First Generation Lessons Learned
• Event ingestion too tightly coupled to indexing
– Manual re-indexing for temporary SOLR issues
• Multiple Indexing strategies needed
– 4 orders of magnitude difference between our high volume users
and our low volume users (10 eps vs. 100,000+ eps)
– Too much system overhead for low volume users
– Difficult to support changing sharding strategies for a customer
14. Big Data Infrastructure Solutions
We are not alone…
• Our challenges
–
–
–
–
–
Massive incoming event stream
Fundamentally multi-tenant
Scalable framework for analysis
Near real-time indexing
Time series index management
Scalability
Real
Time
Analytics
Multi
tenant
SaaS
15. Apache Kafka
• Overview
–
–
–
–
An Apache project initially developed at LinkedIn
Distributed publish-subscribe messaging system
Specifically designed for real time activity streams
Does not follow JMS Standards nor uses JMS APIs
• Key Features
–
–
–
–
Persistent messaging
High throughput
Uses Zookeeper for forming a cluster of nodes
Supports both queue and topic semantics
17. Storm Framework
• Storm (open sourced by Twitter)
– Open sourced September 2011
– Now an Apache Software Foundation project
• Currently Incubator Status
• Framework is for stream processing
–
–
–
–
Distributed
Fault tolerant
Computation
Fail-fast components
20. ElasticSearch
• Open source
– Commercial support available from ElasticSearch.com
– Growing open-source community
•
•
•
•
•
Distributed search engine
Fully exposes Lucene search functionality
Built for clustering from the ground-up
High availability
Multi-tenancy
21. ElasticSearch In Action
• Add/delete nodes dynamically
• Add indexes with REST api
• Indexes and Nodes have attributes
– Indexes automatically moved to best Nodes
• Indexes can be sharded
• Supports bulk insertion of events
• Plugins for monitoring cluster
23. Generation 2 – The Challenge
• Always accept log data
– Never make a customer’s incident worse
• Never drop log data
– A single log message could be critical
• True Elasticity
24. Perfect Match For Real Time Log Events
• Apache Kafka
– Extremely high-performance pub-sub persistent queue
• Consumer tracks their location in queue
– A good fit for our use cases
• Multiple Kafka brokers
– Good match for AWS
• Multiple brokers per region
• Availability Zone separation
25. Real Time Event Processing
• Twitter Storm
– Scalable realtime computation sysytem
• Storm used as a “pull” system
– Provisioned for average load, not peak load
– Input from Kafka queue
• Worker nodes can be scaled dynamically
• Elasticity is key
– Another good match for AWS
• Able to scale workers up and down dynamically
28. Loggly Collector Performance
• C++ multi-threaded
• Boost ASIO framework
• Each Collector can
handle 250k+ events
per second
– Per m2.2xlarge instance
1 x EC2 m2.2xlarge Collector instance
(300 byte average event size).
31. Event Pipeline in Summary
• Storm provides Complex Event Processing
– Where we run much of our secret-sauce
• Stage 1 contains the raw Events
• Stage 2 contains processed Events
• Snapshot the last day of Stage 2 events to S3
33. Loggly and Index Management
• Indexes are time-series data
– Separated by customer
– Represent slices of time
• Higher volume index will have shorter time slice
• Multi-tier architecture for efficient indexing
– Multiple indexing tiers mapped to different AWS instance types
• Efficient use of AWS resources
34. AWS Deployment Instances – Stage 1
c1.xlarge
•
•
•
Compute-optimized
High-traffic ingestion points
Disk not important
m2.2xlarge!
•
•
Memory-optimized
Disk buffer caching
4K Provisioned IOPs EBS
•
•
•
Ensures consistent IO
No noisy-neighbours
Persistent storage
38. ELB in front of Collector Had Limitations
• Initial testing used AWS Elastic Load Balancer for incoming
events:
• ELB doesn’t allow forwarding port 514 (syslog)
• ELB doesn’t support forwarding UDP
• Event traffic can burst and hit ELB performance limits
39. AWS Route 53 DNS Round Robin a Win
• DNS Round Robin is a pretty basic load balancing
– Not a bump in the wire
• Take advantage of AWS failover health checks
– When a collector goes out of service, it will be out of the DNS rotation
• Round Robin across multiple regions, AZs
– Latency based resolution optimizes inbound traffic
• Collector failover takes longer than it would with ELB
40. Our First Plan for Log Events
• Cassandra
– Highly scalable key-value store
– Impressive write performance a good match for us
– Apache project plus commercial support with DataStax
• Use Cassandra for both our Event Queue and
Persistent Store
– Our strategy was to get the raw event in to Cassandra
– …then perform workflow processing on events
41. Design meets Reality
• Cassandra not designed to be a message queue
• Hard to track Events received out-of-order
• Multi-tenancy requires handling data bursts
– Collectors still needed to be able to buffer to disk
– Added complexity and became a point of failure
44. Kafka enables Staging Architecture
• Kafka producer doesn’t care if there are
multiple consumers
• Staging system runs pre-production code
• Pub-sub allows us to randomly index a
fraction of our production load
• A highly-effective pre-production system
45.
46. Big Wins
• Leveraging AWS services
–
–
–
–
Multi-Region, multi-AZ
Provisioned IOPS for availability and scale
Route 53 DNS support with latency resolution
Easy to increase and decrease Storm resources
• Leveraging Open Source infrastructure
– Apache Kafka
– Twitter Storm
– ElasticSearch
• Staging
47. Future of Loggly Infrastructure
• Better use of Amazon Auto Scaling
• Keep logs in region they came from
– We currently keep indexed events in a single region
• Integrate with more external data sources
48. Feedback
• Questions?
Send’em to us on twitter: @loggly
Kick the tires, sign up: loggly.com/ace
Jim Nisbet
CTO and VP of Engineering, Loggly
Philip O’Toole
Lead Developer, Infrastructure, Loggly