Thinking in Streaming - Twitter Streaming API

•Download as KEY, PDF•

8 likes•1,480 views

Streaming API clients perform many of the same operations as our Streaming API servers. We'll discuss our streaming server's internal architecture, stream processing algorithms and how they relate to a typical client implementation. The focus will be on techniques for sorting and de-duplicating infinite roughly-sorted at-least-once delivery streams, data loss prevention, scaling for the Firehose, and practical operational experience.

Technology

Thinking In
Streaming
John Kalucki
@jkalucki
Infrastructure

Turtles All The Way Down
• Your client ≅ Our server

• Gather Events

• Parse JSON

• Match on Predicates

• Route to Consumers

Properties
• Offered

• At Least Once

• Roughly Sorted (K-Sorted)

• Desired

• Exactly Once

• Sorted

Plan
• Over-deliver
Ensure At Least Once

• De-duplicate
Unordered Exactly Once

• Sort
Ordered Exactly Once

Why At Least Once?
• Exactly Once impractical across streams

• Clients must handle reconnect over-delivery

• Reuse this capability

• Mask upstream failures

• Relax server restart issues

Why At Least Once?
• Exactly Once impractical across streams

• Clients must handle reconnect over-delivery

• Reuse this capability to

• Mask upstream failures

• Relax server restart issues

Startup
• Prefetch from peer to populate circular buffer

• Go multi-user

• Consume Kestrel backlog - duplicates between:

• Buffer and backlog

• Previous connection and backlog

• Steady State: Exactly Once Delivery

Upstream Failure
• Cascaded source fails

• Fail over to next peer

• Over-request to avoid loss

• Steady State: Exactly Once Delivery

Client Over-delivery
• Use Count Parameter after fast reconnect

• Deep backfill from REST API

• Client offline offline for a while

• User first issues new query

• Overlap connections slightly

Infinite Streams
• De-duplicating a randomly ordered infinite
stream requires infinite time and storage

• Sorting? Ditto

• I have neither infinite time nor storage

Roughly Sorted
• A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤
n, i ≤ r - k implies aᵢ ≤ aᵣ

• Strictly sorted is 0-sorted.

• Transpose two adjacent values in a 0-sorted
sequence, becomes 1-sorted.

• K For the firehose?

Firehose K
500k IDs Sunday Night Monday Peak
100.0% 3356 2507
99.99% 650 509
99.90% 232 271
99.00% 143 160
90.00% 36 45
50.00% 5 7
Average 14 17

Noisy

Pessimist’s K
• In theory could be hours & millions of events

• Practically, if current and stale queues exist:

• We’ll flush the stale queues before exposing

• You’ll never know this happened

• If all queues stale:

• We’ll deliver the backlog

• K remains reasonable

Unordered
De-duplication
• Create two HashSets: Primary, Secondary,
each preallocated to size K

• New event is duplicate if ID exists in Primary

• Add new ID to both HashSets

• When Primary.size > K / 2
Primary.clear
Swap Primary & Secondary

Unordered
De-duplication
• Bounded memory consumption

• O(n) behavior

• Low latency

• Emit first tweet

• Discard subsequent duplicates

• Cheaper than de-duplication by sorting?
Probably depends on K

Ordered & De-duplicated
• Insertion sort and de-duplicate by ID into a
decreasing order list

• While length > K, remove sorted tail

Ordered & De-duplicated
• O(n) --- O(n * K)

• Bounded memory consumption

• Induces latency of K

• Assumes average items not very unsorted

• K is usually large to handle the outliers

Routing Events
• By Keyword or by UserId

• Add predicates to HashMap

• Apply events to Map

• Query holds private predicate set for later Map
removal

• O(n)

Reliability
• Decompose

• Decouple

• Monitor

Monitoring
• What to look at?

• Latency

• Throughput

• Errors

• Alerting

Horizontal Scale
• Firehose keeps Growing.

• Eventually Firehose stream will become
impractical.

• Partition the Firehose into N streams.

Docker is all the rage these days. While one doesn't hear much about Solr on Docker, we're here to tell you not only that it can be done, but also share how it's done. We'll quickly go over the basic Docker ideas - containers are lighter than VMs, they solve "but it worked on my laptop" issues - so we can dive into the specifics of running Solr on Docker. We'll do a live demo showing you how to run Solr master - slave as well as SolrCloud using containers, how to manage CPU assignments, constraint memory and use Docker data volumes when running Solr in containers. We will also show you how to create your own containers with custom configurations. Finally, we'll address one of the core Solr questions - which deployment type should I use? We will demonstrate performance differences between the following deployment types: - Single Solr instance running on a bare metal machine - Multiple Solr instances running on a single bare metal machine - Solr running in containers - Solr running on virtual machine - Solr running on virtual machine using unikernel For each deployment type we'll address how it impacts performance, operational flexibility and all other key pros and cons you ought to keep in mind.

Reactive Summit 2017 Highlights!

Fabio Tiriticco

Running Kubernetes at scale is challenging and you can often end up in situations where you have to debug complex and unexpected issues. This requires understanding in detail how the different components work and interact with each other. Over the last 3 years, Datadog migrated most of its workloads to Kubernetes and now manages dozens of clusters consisting of thousands of nodes each. During this journey, engineers have debugged complex issues with root causes that were sometimes very surprising. In this talk Laurent and Tabitha will share some of these stories, including a favorite: how a complex interaction between familiar Kubernetes components allowed an OOM-killer invocation to trigger the deletion of a namespace.

Evolution of kube-proxy (Brussels, Fosdem 2020)

Laurent Bernaille

Kube-proxy enables access to Kubernetes services (virtual IPs backed by pods) by configuring client-side load-balancing on nodes. The first implementation relied on a userspace proxy which was not very performant. The second implementation used iptables and is still the one used in most Kubernetes clusters. Recently, the community introduced an alternative based on IPVS. This talk will start with a description of the different modes and how they work. It will then focus on the IPVS implementation, the improvements it brings, the issues we encountered and how we fixed them as well as the remaining challenges and how they could be addressed. Finally, the talk will present alternative solutions based on eBPF such as Cilium.

10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...

Laurent Bernaille

Kubernetes is a very powerful and complicated system, and many users don’t understand the underlying systems. Come learn how your users can abuse container runtimes, overwhelm your control plane, and cause outages - it’s actually quite easy! In the last year, we have containerized hundreds of applications and deployed them in large scale clusters (more than 1000 nodes). The journey was eventful and we learned a lot along the way. We’ll share stories of our ten favorite Kubernetes foot guns, including the dangers of cargo culting, rolling updates gone wrong, the pitfalls of initContainers, and nightmarish daemonset upgrades. The talk will present solutions we adopted to avoid or work around some these problems and will finally show several improvements we plan deploy in the future. Similar to the Kubecon talk with the same title with a few new incidents.

Self Created Load Balancer for MTA on AWSsharu1204

Ease of use in Apache Solr

Anshum Gupta

Realtime Statistics based on Apache Storm and RocketMQ

Xin Wang

Storm worker redesign

Roshan Naik

Making the most out of kubernetes audit logs

Laurent Bernaille

The Kubernetes audit logs are a rich source of information: all of the calls made to the API server are stored, along with additional metadata such as usernames, timings, and source IPs. They help to answer questions such as “What is overloading my control plane?” or “Which sequence of events led to this problematic situation?”. These questions are hard to answer otherwise—especially in large clusters. At Datadog, we have been running clusters with 1000+ nodes for more than a year and during that time, the audit logs have proved invaluable. In this presentation, we will first introduce the audit logs, explain how they are configured, and review the type of data they store. Finally, we will describe in detail several scenarios where they have helped us to diagnose complex problems.

How to tune Kafka® for production

confluent

Docker and Maestro for fun, development and profit

Maxime Petazzoni

SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...

SaltStack

This talk will highlight how the OpenStack Infrastructure team uses SaltStack for event-driven orchestration of its various cloud infrastructure components. The speakers will review the flexibility of Salt in a complex automation environment. Salt plays very well with other tools, including Puppet, which is especially critical in the OpenStack Infrastructure environment which requires the event-driven orchestration functions of Salt to synchronize workflow timing of OpenStack Infrastructure components and events. To learn when and where the next SaltConf will be, subscribe to our newsletter here: http://www.saltstack.com/salt-ink-newsletter or follow us on Twitter: http://www.twitter.com/saltstackinc

Service discovery in Docker environments

alexandru giurgiu

Swift container syncOpen Stack

Kubernetes DNS Horror Stories

Laurent Bernaille

DNS is one of the Kubernetes core systems and can quickly become a source of issues when you’re running clusters at scale. For over a year at Datadog, we’ve run Kubernetes clusters with thousands of nodes that host workloads generating tens of thousands of DNS queries per second. It wasn’t easy to build an architecture able to handle this load, and we’ve had our share of problems along the way. This talk starts with a presentation of how Kubernetes DNS works. It then dives into the challenges we’ve faced, which span a variety of topics related to load, connection tracking, upstream servers, rolling updates, resolver implementations, and performance. We then show how our DNS architecture evolved over time to address or mitigate these problems. Finally, we share our solutions for detecting these problems before they happen—and identifying misbehaving clients.

Spinnaker - Bay Area AWS Meetup - 20160726

Adam Jordens

Scaling an invoicing SaaS from zero to over 350k customers

Speck&Tech

ABSTRACT: Fatture in Cloud was born in late 2013 on a single-server machine and scaled from zero to 35k customers at the end of 2018. Then, we faced the mandatory electronic invoicing which came into effect in Italy on 1st January 2019, and we experienced a huge growth to 350k customers in few months. In these 5 years, I've learned a lot about cloud architecture, scalability, optimization, DevOps, and we eventually achieved a 99,99% uptime even in the huge growth period. BIO: Daniele Ratti is the Founder and CEO of Fatture in Cloud, which is currently the leader invoicing platform in Italy, counting more than 350k customers.

Integration testing for salt states using aws ec2 container service

SaltStack

A SaltConf16 use case talk by Steven Braverman of Dun & Bradstreet. Testing configuration changes for multiple server roles can be time consuming when real instances or legacy container systems are used. Applying configuration changes to each role in parallel can be difficult. So what's the best way to test configuration changes efficiently, quickly, and securely prior to applying them? See how an integrated test setup using AWS EC2 Container Service (ECS), AWS AutoScaling Group, and SaltStack simplifies the application of configuration changes and allows you to test configuration changes in parallel to reduce the time spent testing.

What's new in Ansible 2.0

Allan Denot

Kubernetes at Datadog Scale

Docker, Inc.

Ara Pulido, Datadog - Container technologies, although not new, have increased their popularity in the past few years, with container orchestrators allowing companies around the world to adopt these technologies to help them ship and scale microservices with precision and velocity. Kubernetes is currently the most popular container orchestration platform, and while many organizations are migrating their workloads to it, Kubernetes is still relatively immature. New corner cases, errors, and quirks are regularly discovered as users push the boundaries of size and scale. When Datadog adopted Kubernetes we discovered some of these boundaries the hard way, and we continuously challenge and modify our infrastructure decisions in order to fit our use case. Join me in this talk for our story on what we learned while we scaled our Kubernetes clusters, the contributions to Kubernetes we made along the way, and how you can apply those learnings when growing your Kubernetes clusters from a handful to hundreds or thousands of nodes.

Thoughts on consistency models

rogerbodamer

Call me maybe: Jepsen and flaky networks

Shalin Shekhar Mangar

In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them. Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.

What's hot

Deploying Immutable infrastructures with RabbitMQ and Solr

Jordi Llonch

How the OOM Killer Deleted My Namespace

Laurent Bernaille

Evolution of kube-proxy (Brussels, Fosdem 2020)

Laurent Bernaille

10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...

Laurent Bernaille

Self Created Load Balancer for MTA on AWSsharu1204

Ease of use in Apache Solr

Anshum Gupta

Realtime Statistics based on Apache Storm and RocketMQ

Xin Wang

Storm worker redesign

Roshan Naik

Making the most out of kubernetes audit logs

Laurent Bernaille

How to tune Kafka® for production

confluent

Docker and Maestro for fun, development and profit

Maxime Petazzoni

SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...

SaltStack

Service discovery in Docker environments

alexandru giurgiu

Swift container syncOpen Stack

Kubernetes DNS Horror Stories

Laurent Bernaille

Spinnaker - Bay Area AWS Meetup - 20160726

Adam Jordens

Scaling an invoicing SaaS from zero to over 350k customers

Speck&Tech

Integration testing for salt states using aws ec2 container service

SaltStack

What's new in Ansible 2.0

Allan Denot

Kubernetes at Datadog Scale

Docker, Inc.

What's hot (20)

Deploying Immutable infrastructures with RabbitMQ and Solr

How the OOM Killer Deleted My Namespace

Evolution of kube-proxy (Brussels, Fosdem 2020)

10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...

Self Created Load Balancer for MTA on AWS

Ease of use in Apache Solr

Realtime Statistics based on Apache Storm and RocketMQ

Storm worker redesign

Making the most out of kubernetes audit logs

How to tune Kafka® for production

Docker and Maestro for fun, development and profit

SaltConf14 - Anita Kuno, HP & OpenStack - Using SaltStack for event-driven or...

Service discovery in Docker environments

Swift container sync

Kubernetes DNS Horror Stories

Spinnaker - Bay Area AWS Meetup - 20160726

Scaling an invoicing SaaS from zero to over 350k customers

Integration testing for salt states using aws ec2 container service

What's new in Ansible 2.0

Kubernetes at Datadog Scale

Similar to Thinking in Streaming - Twitter Streaming API

Thoughts on consistency models

rogerbodamer

Call me maybe: Jepsen and flaky networks

Shalin Shekhar Mangar

Ben Coverston - The Apache Cassandra Project

Morningstar Tech Talks

Abstract: Cassandra is a new kind of database: it is more than a single-machine system. It naturally runs in a High-Availability configuration. All nodes in the system are symmetric; there is no single point of failure. As you add machines, failure becomes routine, and Cassandra is built to tolerate that with no interruptions. Cassandra is linearly scalable with good performance characteristics for very small and very large data stores. Unlike earlier efforts, Cassandra is more than just a key-value store; it is a structured data store which can facilitate complex use cases and queries. Cassandra allows for random access to your data organized into rows and columns. Cassandra is different, and exciting. This presentation will discuss the pros and cons of using Cassandra, and why it has seen such amazing adoption in the past year. Bio: Ben Coverston is Director of Operations at DataStax (formerly knows as Riptano), a provider of software, support, services, training, resources and help for Cassandra. He has been involved in enterprise software his entire career. Working in the airline industry, he helped to build some of the highest volume online booking sites in the world. He saw first hand the consequences of trying to solve real world scalability problems at the limit of what traditional relational databases are capable of.

Dashboard Mania

Tim Lossen

The Data Mullet: From all SQL to No SQL back to Some SQLDatadog

Generators, Coroutines and Other Brain Unrolling Sweetness. Adi Shavit ➠ Cor...

corehard_by

C++20 brings us coroutines and with them the power to create generators, iterables and ranges. We'll see how coroutines allow for cleaner, more readable, code, easier abstraction and genericity, composition and avoiding callbacks and inversion of control. We'll discuss the pains of writing iterator types with distributed internal state and old-school co-routines. Then we'll look at C++20 coroutines and how easy they are to write clean linear code. Coroutines prevent inversion of control and reduce callback hell. We'll see how they compose and play with Ranges with examples from math, filtering, rasterization. The talk will focus more on co_yield and less on co_await and async related usages.

Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...

Ontico

Существует множество архитектур и способов масштабирования систем. Сегодня многие компании мигрируют в облачные сервисы или используют контейнеры. Но действительно ли это так необходимо и нужно ли следовать трендам? В данном докладе мне бы хотелось рассказать об архитектуре, которую я спланировал и внедрил в компании InnoGames. Архитектура, не требующая вмешательства администратора в случае лавинообразного увеличения нагрузки и, что ещё более важно, умеющая редуцироваться в случае отсутствия её для экономии затрат. Вы узнаете об опыте создания сервиса с очень непростыми критериями и поймёте, что не обязательно платить в 3 раза дороже за AWS или любую подобную систему. - Что такое CRM. Зачем нам этот сервис. - Инфраструктура. -- Graphite. Почему он должен быть надежным и быстрым. -- Puppet + gitlab. -- Балансировка нагрузки. -- Наше облако. Зачем нам openstack, когда есть serveradmin!? Как роль сервера определяется несколькими атрибутами в веб-интерфейсе. -- Nagios + аггрегаторы. Другой взгляд на то, как мониторить сервисы через Graphite. -- Мониторинг кластеров. Clusterhc и Grafsy. -- Brassmonkey. Как мы написали своего сисадмина на python. -- Бэкапы. - Архитектура CRM3. - Autoscaling или как проанализировать кучу данных и принять решения.

Andy Parsons Pivotal June 2011Andy Parsons

When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...

confluent

In the financial industry, losing data is unacceptable. Financial firms are adopting Kafka for their critical applications. Kafka provides the low latency, high throughput, high availability, and scale that these applications require. But can it also provide complete reliability? As a system architect, when asked “Can you guarantee that we will always get every transaction,” you want to be able to say “Yes” with total confidence. In this session, we will go over everything that happens to a message – from producer to consumer, and pinpoint all the places where data can be lost – if you are not careful. You will learn how developers and operation teams can work together to build a bulletproof data pipeline with Kafka. And if you need proof that you built a reliable system – we’ll show you how you can build the system to prove this too.

London devops loggingTomas Doran

FP Days: Down the Clojure Rabbit HoleChristophe Grand

Kubernetes Walk Through from Technical View

Lei (Harry) Zhang

Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...

ScyllaDB

Seek and Destroy Kafka Under Replication

HostedbyConfluent

"It's important that even under load, Apache Kafka ensures user topics are fully replicated in synch. Replication is essential to endure resilience to data loss, so both users and operators care about it. If a topic partition falls out of the ISR (In-Synch-replicas) set, a user experiences unavailability (when producing with the default acknowledgment setting). Users may use non-default acks mode to work around it, but the effect on a Kafka cluster is to make the under-replication worse. Even simple Under replication with no Under Min Isr is to be avoided as a cluster update may cause the dreaded Under Min ISR. There are a number of settings that can be used, from quotas to number of replication threads to more low-level settings. This session wants to show how we successfully measured and evolved our Kafkas configuration, with the goal of giving the best possible user experience (and resilience to their data). Hofstadter's Law applied! ""It always takes longer than you expect, even when you take into account Hofstadter's Law."""

Modern Cryptography

James McGivern

Lagom - Mircoservices "Just Right"

Markus Jura

We designed a new framework, made for Microservices. Making it easier for developers to build microservices-based systems – systems that communicate asynchronously, self-heal, scale elastically and remain responsive no matter what bad stuff is happening. And all this without the pain of selecting and mixing components, from a plethora of libraries that were originally built for other things. In this presentation, we reveal this new way for Java developers to not only understand and begin building microservices, but also to seamlessly push them into staging and production

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...

Lucidworks

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2kvXlPd This CloudxLab Introduction to Apache ZooKeeper tutorial helps you to understand ZooKeeper in detail. Below are the topics covered in this tutorial: 1) Data Model 2) Znode Types 3) Persistent Znode 4) Sequential Znode 5) Architecture 6) Election & Majority Demo 7) Why Do We Need Majority? 8) Guarantees - Sequential consistency, Atomicity, Single system image, Durability, Timeliness 9) ZooKeeper APIs 10) Watches & Triggers 11) ACLs - Access Control Lists 12) Usecases 13) When Not to Use ZooKeeper

Consul - service discovery and others

Walter Liu

Elegant concurrency

Mosky Liu

Writing concurrent program is hard; maintaining concurrent program even is a nightmare. Actually, a pattern which helps us to write good concurrent code is available, that is, using “channels” to communicate. This talk will share the channel concept with common libraries, like threading and multiprocessing, to make concurrent code elegant. It's the talk at PyCon TW 2017 [1] and PyCon APAC/MY 2017 [2]. [1]: https://tw.pycon.org/2017 [2]: https://pycon.my/pycon-apac-2017-program-schedule/

Similar to Thinking in Streaming - Twitter Streaming API (20)

Thoughts on consistency models

Call me maybe: Jepsen and flaky networks

Ben Coverston - The Apache Cassandra Project

Dashboard Mania

The Data Mullet: From all SQL to No SQL back to Some SQL

Generators, Coroutines and Other Brain Unrolling Sweetness. Adi Shavit ➠ Cor...

Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...

Andy Parsons Pivotal June 2011

When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...

London devops logging

FP Days: Down the Clojure Rabbit Hole

Kubernetes Walk Through from Technical View

Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...

Seek and Destroy Kafka Under Replication

Modern Cryptography

Lagom - Mircoservices "Just Right"

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...

Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab

Consul - service discovery and others

Elegant concurrency

Recently uploaded

20240609 QFM020 Irresponsible AI Reading List May 2024

Matthew Sinclair

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

GridMate - End to end testing is a critical piece to ensure quality and avoid...

ThomasParaiso2

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Mind map of terminologies used in context of Generative AI

Kumud Singh

Recently uploaded (20)

20240609 QFM020 Irresponsible AI Reading List May 2024

20240605 QFM017 Machine Intelligence Reading List May 2024

GraphRAG is All You need? LLM & Knowledge Graph

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Communications Mining Series - Zero to Hero - Session 1

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

The Art of the Pitch: WordPress Relationships and Sales

A tale of scale & speed: How the US Navy is enabling software delivery from l...

Microsoft - Power Platform_G.Aspiotis.pdf

Climate Impact of Software Testing at Nordic Testing Days

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Pushing the limits of ePRTC: 100ns holdover for 100 days

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Large Language Model (LLM) and it’s Geospatial Applications

GridMate - End to end testing is a critical piece to ensure quality and avoid...

PCI PIN Basics Webinar from the Controlcase Team

National Security Agency - NSA mobile device best practices

RESUME BUILDER APPLICATION Project for students

Mind map of terminologies used in context of Generative AI

Thinking in Streaming - Twitter Streaming API

2. Thinking In Streaming John Kalucki @jkalucki Infrastructure

3. Turtles All The Way Down • Your client ≅ Our server • Gather Events • Parse JSON • Match on Predicates • Route to Consumers

4. Properties • Offered • At Least Once • Roughly Sorted (K-Sorted) • Desired • Exactly Once • Sorted

5. Plan • Over-deliver Ensure At Least Once • De-duplicate Unordered Exactly Once • Sort Ordered Exactly Once

6. Why At Least Once? • Exactly Once impractical across streams • Clients must handle reconnect over-delivery • Reuse this capability • Mask upstream failures • Relax server restart issues

7. Why At Least Once? • Exactly Once impractical across streams • Clients must handle reconnect over-delivery • Reuse this capability • Mask upstream failures • Relax server restart issues

8. Why At Least Once? • Exactly Once impractical across streams • Clients must handle reconnect over-delivery • Reuse this capability to • Mask upstream failures • Relax server restart issues

10. Startup • Prefetch from peer to populate circular buffer • Go multi-user • Consume Kestrel backlog - duplicates between: • Buffer and backlog • Previous connection and backlog • Steady State: Exactly Once Delivery

11. Startup • Prefetch from peer to populate circular buffer • Go multi-user • Consume Kestrel backlog - duplicates between: • Buffer and backlog • Previous connection and backlog • Steady State: Exactly Once Delivery

12. Upstream Failure • Cascaded source fails • Fail over to next peer • Over-request to avoid loss • Steady State: Exactly Once Delivery

13. Client Over-delivery • Use Count Parameter after fast reconnect • Deep backfill from REST API • Client offline offline for a while • User first issues new query • Overlap connections slightly

14. De-Duplication

15. Infinite Streams • De-duplicating a randomly ordered infinite stream requires infinite time and storage • Sorting? Ditto • I have neither infinite time nor storage

16. Roughly Sorted • A sequence α is k-sorted IFF ∀ i, r, 1 ≤ i ≤ r ≤ n, i ≤ r - k implies aᵢ ≤ aᵣ • Strictly sorted is 0-sorted. • Transpose two adjacent values in a 0-sorted sequence, becomes 1-sorted. • K For the firehose?

17. Firehose K 500k IDs Sunday Night Monday Peak 100.0% 3356 2507 99.99% 650 509 99.90% 232 271 99.00% 143 160 90.00% 36 45 50.00% 5 7 Average 14 17 Noisy

18. Firehose K 500k IDs Sunday Night Monday Peak 100.0% 3356 2507 99.99% 650 509 99.90% 232 271 99.00% 143 160 90.00% 36 45 50.00% 5 7 Average 14 17 Noisy

19. Pessimist’s K • In theory could be hours & millions of events • Practically, if current and stale queues exist: • We’ll flush the stale queues before exposing • You’ll never know this happened • If all queues stale: • We’ll deliver the backlog • K remains reasonable

20. Unordered De-duplication • Create two HashSets: Primary, Secondary, each preallocated to size K • New event is duplicate if ID exists in Primary • Add new ID to both HashSets • When Primary.size > K / 2 Primary.clear Swap Primary & Secondary

21. Unordered De-duplication • Bounded memory consumption • O(n) behavior • Low latency • Emit first tweet • Discard subsequent duplicates • Cheaper than de-duplication by sorting? Probably depends on K

22. Ordered & De-duplicated • Insertion sort and de-duplicate by ID into a decreasing order list • While length > K, remove sorted tail

23. Ordered & De-duplicated • O(n) --- O(n * K) • Bounded memory consumption • Induces latency of K • Assumes average items not very unsorted • K is usually large to handle the outliers

24. Routing Events • By Keyword or by UserId • Add predicates to HashMap • Apply events to Map • Query holds private predicate set for later Map removal • O(n)

25. Reliability • Decompose • Decouple • Monitor

26. Monitoring • What to look at? • Latency • Throughput • Errors • Alerting

27. Horizontal Scale • Firehose keeps Growing. • Eventually Firehose stream will become impractical. • Partition the Firehose into N streams.

Editor's Notes

There is a lot of symmetry in what the Streaming API servers do and what your streaming clients do. In both cases we&#x2019;re gathering events, parsing them, and farming them out to various consumers. The issues are similar at all processing points in the stream.
We present a stream of events that is roughly sorted by created at time. This means that the events are mostly in created at time order, but not exactly so. We&#x2019;ve designed our system to publish each event at least once -- which means none are lost, but there may, at times, be duplicates. I&#x2019;ll discuss why our streams have these properties. Also, you&#x2019;ll probably want to display or process tweets exactly once -- none missing and none duplicated. You might also want to present them sorted, or you might be OK with a rough sorting. I&#x2019;ll go over two algorithms for converting what the API offers into the stream that you want.
The basic plan is to over deliver events and then de-duplicate them to provide an exactly once quality of service. One technique is to just de-duplicate with set logic, the other is to sort and de-duplicate. There are trade offs with each.
First, let&#x2019;s see why the Streaming API offers events at least once. It would be nice if we could offer everything transactionally, that is, exactly once. But, it&#x2019;s impractical to synchronize this state across client reconnections. For example, it&#x2019;s unlikely that you&#x2019;ll reconnect to the same server.
Also, event streams aren&#x2019;t strictly ordered, so we wouldn&#x2019;t know what to deliver. We&#x2019;d have to coordinate a large vector of sent events between servers. And, clients would have to transactionally acknowledge all events received. This is quite impractical at scale unless we sorted streams, but this would introduce latency. We&#x2019;ll see why sorting induces latency later.
Yet, first and foremost, we want a very low latency experience. And, we want a simple programming model for clients. So, we assume that clients can over-request when reconnecting, and post process to get the required stream properties. Once we make this fundamental assumption, we can reuse this to also handle the internal data loss risk as well.
Our Streaming API server is called Hosebird. Hosebird receives events from the rest of the Twitter system through Kestrel message queues. Two hosebird processes in each cluster read transactionally from Kestrel. The rest of the servers in a cluster cascade via Streaming HTTP.
When a hosebird server starts, it prefetches events from a peer to pre-populate its circular buffers. These buffers are used to support the count parameter, which allows some historical back fill on streaming queries. Count allows your stream to start back a few minutes, then catch up and transition to real time streaming.
This startup prefetching creates a window where you might see the same event twice, if you are unlucky enough to connect to a very recently restarted server. The backlog read from kestrel will contain some of the same events that were prefetched into the buffer. The backlog may also have events that you read on your last connection. You might have to suffer through a minute or so of duplicates as the backlog is processed and displaces the prefetched events in the circular buffer. Outside of this restart case, during steady state processing, we deliver each event exactly once on fanout servers.
When a cascaded server has its source Hosebird restart, say during a deploy, the server needs to quickly fail over to another source. A gap in the stream would be introduced during the failure, detection and reconnection window. We cover this gap by requesting some back-fill from the new source. This causes a short period of duplicated events. During steady state processing, however, we deliver each event exactly once on cascaded servers.
Your client should use these same techniques on reconnect. Over request with the count parameter if the connection was momentarily lost. If the client has been disconnected for an extended period, you&#x2019;ll have to back fill from the REST API. When you need to make a predicate change, you can create a new connection, wait for the first event to arrive, then disconnect the old connection. This should generally produce an at least once stream.
Let&#x2019;s talk about de-duplication on your end.
A finite stream looks a lot like a relational database table -- a finite relation. We&#x2019;re used to thinking about finite relations. But, a stream appears as an infinite relation, you can&#x2019;t ever read to the end. Also, since we want very low latency, we can&#x2019;t wait to read to the end. We have to present results immediately.
A roughly sorted sequence is mostly sorted, where no element is more than K positions away from its strictly sorted position. At Twitter, we talk about K sorted things all the time. K this, K that. Nothing is strictly ordered. We have relaxed various legs of the CAP theorem to make our distributed system feasible. We&#x2019;ve never had strictly ordered event processing. Tweets are applied to your timelines in a rough ordering. On the REST API, we sort the vector before we present it to you, but it&#x2019;s very loose behind the scenes. Likewise, events show up in the Streaming API roughly sorted by created at time.
Here are two samples from the status firehose. I took five hundred thousand status ids, and did an insertion sort into a reverse sorted list. The most recent id at the head, the oldest status at the tail. These distributions show the number of list elements traversed before finding the sorted insertion point. So, the average and median number of hops are pretty small. The hundred percent case, the worst case, shows a much larger K.
Assuming about 600 events per second on this stream, back when I took this sample, we can see that events show up as much as 5 seconds out of order. Close comparison of the distributions shows that they&#x2019;re very noisy. If you took many samples, they&#x2019;d all have a different shape. Having an idea of K helps us tune our de-duplication algorithms.
Daily operational issues cause K to grow beyond 5 seconds now and then. It&#x2019;s hard to say what a good upper bound for a display client should be. Something around a few minutes would cover most issues we&#x2019;ve had over the last six months. A long-term storage client might want to assume a K of a few hours or a day or so. In the unlikely event that something goes really wrong with the system, we&#x2019;ll make a judgement call on recovery. We&#x2019;ll probably bias towards delivering the backlog, but, if there&#x2019;s a partial failure, we&#x2019;ll keep your K in mind.
Now that we have a handle on K we can think about de-duplication. An infinite, but roughly sorted, stream can be de duplicated with some set logic. The key is efficiently aging out irrelevant set members. One way is to keep two hashes, and alternately clear them. You don&#x2019;t have to do any fancy tracking of items, and off the shelf HashSets will work just fine. The union of the two sets contain at least K items and allow deduplication of a K sorted sequence.
Given the Firehose K, you don&#x2019;t even need all that much space to de-duplicate. Please don&#x2019;t resort to using mySQL primary keys to de-duplicate streams. Unnecessary. The nice thing here is that we can emit events as they arrive and throw away late arriving dups. We don&#x2019;t need to add any latency.
On the other hand, if we want a sorted and deduplicated stream, we have to do a little more work. Given the Firehose K distribution, doing an insertion sort isn&#x2019;t the worst thing. Most events don&#x2019;t need to traverse too deeply into the list. Elements dequeued from the tail of the list are sorted and deduplicated.
This algorithm does, however introduce a latency of K. We can&#x2019;t emit a sorted event unless we have at least K elements to examine. Still, this is quite practical to do in memory. You can plow through a lot of ids per second even in a scripting language like Ruby.
Now that we have a de-duplicated stream, we need to route it to consumers. This can be done very cheaply by registering every consumer&#x2019;s predicates in a HashMap. If, say, you are displaying columns of search results, like TweetDeck, you can have each column register its keywords in the HashMap. Each new event is applied to the HashMap, and routed to all consumers easily. Duplicates can arise, as a given column may have several OR predicates that match. Hosebird uses a generational de-duplication scheme to solve this. This scheme is the degenerate case of the sorted algorithm above. Each client stream maintains just the primary key of the last event. If the same id is presented twice in a row, it can be discarded.
Break things up into components. Host components in separate processes. Measure what happens between components. Use (reliable) queues between components.

Thinking in Streaming - Twitter Streaming API

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Thinking in Streaming - Twitter Streaming API

Similar to Thinking in Streaming - Twitter Streaming API (20)

Recently uploaded

Recently uploaded (20)

Thinking in Streaming - Twitter Streaming API

Editor's Notes