Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
In this talk we will walk through how Apache Kafka and Apache Accumulo can be used together to orchestrate a de-coupled, real-time distributed and reactive request/response system at massive scale. Multiple data pipelines can perform complex operations for each message in parallel at high volumes with low latencies. The final result will be inline with the initiating call. The architecture gains are immense. They allow for the requesting system to receive a response without the need for direct integration with the data pipeline(s) that messages must go through. By utilizing Apache Kafka and Apache Accumulo, these gains sustain at scale and allow for complex operations of different messages to be applied to each response in real-time.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help. (Note that I am proposing this as a joint talk with Colin McCabe, who is also a committer on the Apache Kafka project.)
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
Apache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy.
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
We will introduce HerdDB a distributed database written in Java.
We will see how a distributed database can be built using Apache BookKeeper as write-ahead commit log.
Large scale log pipeline using Apache Pulsar_NozomiStreamNative
Yahoo Japan Corporation has been using Apache Pulsar as a centralized pub-sub messaging platform for more than 3 years.
We adopted Pulsar because of its great performance, scalability and multi-tenancy capability.
It plays an important role to provide our 100+ services in various areas such as e-commerce media, advertising and more.
Recently, we addressed to solve our new use case: A large scale log pipeline.
In our production environment, we are starting to run a lot of our services on container environments.
Our goal is to send all logs and metrics from application containers to various monitoring or analyzing platforms.
We expect Pulsar to keep its performance even in tremendously high traffic volume situations (i.e. in tens of Gbps).
In this presentation, we will talk about our architecture design, producer/consumer side implementation and the result of performance test.
We will also share our experience and knowledge from our production environment operations for more than 3 years.
Takeaway:
- Practical use case of Apache Pulsar on production
- Knowledge of operating Apache Pulsar for large scale data stream
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE confluent
(Yuto Kawamura, LINE Corporation) Kafka Summit SF 2018
LINE is a messaging service with 160+ million active users. Last year I talked about how we operate our Kafka cluster that receives more than 160 billion messages daily, dealing with performance problems to meet our tight requirement. Since last year we have deployed three more new clusters each for different purposes, such as one in different datacenter, one for security sensitive usages and so on, still keeping the fundamental concept: one cluster for everyone to use. While letting many projects using few multi-tenancy clusters greatly saves our operational cost and enables us to concentrate our engineering resources for maximizing their reliability, hosting multiple topics of different kinds of workload led us through a lot of challenges, too.
In this talk I will introduce how we operate Kafka clusters shared among different services, solving troubles we met to maximize its reliability. Especially, one of the most critical issues we’ve solved—delayed consumer Fetch request causing a broker’s network threads to be blocked—should be very interesting because it could have worse overall performance of brokers in a very common situation, and we have managed to solve it leveraging advanced technique such as dynamic tracing and tricky patch to control in-kernel behavior from Java code.
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
In this talk we will walk through how Apache Kafka and Apache Accumulo can be used together to orchestrate a de-coupled, real-time distributed and reactive request/response system at massive scale. Multiple data pipelines can perform complex operations for each message in parallel at high volumes with low latencies. The final result will be inline with the initiating call. The architecture gains are immense. They allow for the requesting system to receive a response without the need for direct integration with the data pipeline(s) that messages must go through. By utilizing Apache Kafka and Apache Accumulo, these gains sustain at scale and allow for complex operations of different messages to be applied to each response in real-time.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Real-time streaming and data pipelines with Apache KafkaJoe Stein
Get up and running quickly with Apache Kafka http://kafka.apache.org/
* Fast * A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.
* Scalable * Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers
* Durable * Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
* Distributed by Design * Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...confluent
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help. (Note that I am proposing this as a joint talk with Colin McCabe, who is also a committer on the Apache Kafka project.)
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
Apache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy.
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
We will introduce HerdDB a distributed database written in Java.
We will see how a distributed database can be built using Apache BookKeeper as write-ahead commit log.
Large scale log pipeline using Apache Pulsar_NozomiStreamNative
Yahoo Japan Corporation has been using Apache Pulsar as a centralized pub-sub messaging platform for more than 3 years.
We adopted Pulsar because of its great performance, scalability and multi-tenancy capability.
It plays an important role to provide our 100+ services in various areas such as e-commerce media, advertising and more.
Recently, we addressed to solve our new use case: A large scale log pipeline.
In our production environment, we are starting to run a lot of our services on container environments.
Our goal is to send all logs and metrics from application containers to various monitoring or analyzing platforms.
We expect Pulsar to keep its performance even in tremendously high traffic volume situations (i.e. in tens of Gbps).
In this presentation, we will talk about our architecture design, producer/consumer side implementation and the result of performance test.
We will also share our experience and knowledge from our production environment operations for more than 3 years.
Takeaway:
- Practical use case of Apache Pulsar on production
- Knowledge of operating Apache Pulsar for large scale data stream
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE confluent
(Yuto Kawamura, LINE Corporation) Kafka Summit SF 2018
LINE is a messaging service with 160+ million active users. Last year I talked about how we operate our Kafka cluster that receives more than 160 billion messages daily, dealing with performance problems to meet our tight requirement. Since last year we have deployed three more new clusters each for different purposes, such as one in different datacenter, one for security sensitive usages and so on, still keeping the fundamental concept: one cluster for everyone to use. While letting many projects using few multi-tenancy clusters greatly saves our operational cost and enables us to concentrate our engineering resources for maximizing their reliability, hosting multiple topics of different kinds of workload led us through a lot of challenges, too.
In this talk I will introduce how we operate Kafka clusters shared among different services, solving troubles we met to maximize its reliability. Especially, one of the most critical issues we’ve solved—delayed consumer Fetch request causing a broker’s network threads to be blocked—should be very interesting because it could have worse overall performance of brokers in a very common situation, and we have managed to solve it leveraging advanced technique such as dynamic tracing and tricky patch to control in-kernel behavior from Java code.
Developing Realtime Data Pipelines With Apache KafkaJoe Stein
Developing Realtime Data Pipelines With Apache Kafka. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesLINE Corporation
Yuto Kawamura
LINE / Z Part Team
At LINE we've been operating Apache Kafka to provide the company-wide shared data pipeline for services using it for storing and distributing data.
Kafka is underlying many of our services in some way, not only the messaging service but also AD, Blockchain, Pay, Timeline, Cryptocurrency trading and more.
Many services feeding many data into our cluster, leading over 250 billion daily messages and 3.5GB incoming bytes in 1 second which is one of the world largest scale.
At the same time, it is required to be stable and performant all the time because many important services uses it as a backend.
In this talk I will introduce the overview of Kafka usage at LINE and how we're operating it.
I'm also going to talk about some engineerings we did for maximizing its performance, solving troubles led particularly by hosting huge data from many services, leveraging advanced techniques like kernel-level dynamic tracing.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
Presentation from kafka meetup 13-SEP-2013. including some notes to clarify some slides. enjoy
Avi Levi
123avi@gmail.com
https://www.linkedin.com/in/leviavi/
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraDataStax Academy
A few years ago, processing large volumes of data was an exclusive problem of big companies. Nowadays, technological advancement allows people to be connected with each other all the time, generating and consuming large amounts of data.
In the challenge to follow Movile's exponential growth and increasing volume of information, we soon realized that traditional relational database and data analysis solutions were no longer a good fit to solve new order issues. Therefore, we present Movile's 'Change Of Seasons', a use case on adopting Apache Cassandra as a solution for critical high-performance distributed systems.
Cassandra Summit 2015 - A Change of SeasonsEiti Kimura
A CHANGE OF SEASONS: A big move to Apache Cassandra!
This is an extended version of the material presented at Cassandra Summit 2015 - Santa Clara - California - USA.
In this presentation I will show you 3 moves, use cases, that constitute our Big Move to Apache Cassandra @Movile.
Walking through relational model to NoSQL solution, hybrid platforms and a staggering cost reduction and throughput increase.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Stream Processing with Apache Kafka and .NETconfluent
Presentation from South Bay.NET meetup on 3/30.
Speaker: Matt Howlett, Software Engineer at Confluent
Apache Kafka is a scalable streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix, Walmart, Airbnb, Goldman Sachs and LinkedIn. In this talk Matt will give a technical overview of Kafka, discuss some typical use cases (from surge pricing to fraud detection to web analytics) and show you how to use Kafka from within your C#/.NET applications.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
2. Speaker introduction
— Yuto Kawamura
— Senior Software Engineer at
LINE
— Leading project to redesign
microservices architecture
w/ Kafka
— Apache Kafka Contributor
— Speaker at Kafka Summit SF
2017
— Also at Kafka Meetup #3
3. Outline
— Kafka at LINE as of today (2018.04)
— Challenges on multitenancy
— Engineering for achieving multitenancy
5. We have more clusters
— Added more clusters since last
year to support:
— Different DCs
— Security sensitive data w/
SASL+TLS
— They are separated by "purposes"
but not by "users" ; our
multitenancy strategy
— Fewer clusters allow us to
concentrate our engineering
resources for maximizing their
performance
— They're concepturally the "Data
Hub" too
6. One cluster has many users
— Topics:
— 100 ~ 400+ per cluster
— Users:
— few ~ tens per cluster
— Messages: 150 billion messages / day in largest cluster
— 3~ million / sec on peak
— None of messages are supposed to lost because all
usages are somehow related to service
8. For doing multitenancy, we have to ensure:
— Certain level of isolation among client workloads
— Cluster is abusing-client proof
— Can track on which client sending particular request
— We have to be confident about what we do to say
"don't worry" for people saying "we want a dedicated
cluster only for us!"
10. Request Quota
— It's more important to manage number of requests over
incoming/outgoing byte rate
— Kafka is amazingly strong at handling large data if they
are well-batched
— => For consumers responses are naturally batched
— => Main danger exists on Producers which configures
linger.ms=0
— Starting from 0.11.0.0, by KIP-124 we can configure request
rate quota 2
2
https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
11. Request Quota
— Manage master of cluster config in YAML inside Ansible repository
— Apply all at once during cluster provisioning by kafka_config ansible module
(developed internally)
— Can tell latest config on cluster w/o quierying cluster, can keep change history
on git
---
kafka_cluster_configs:
- entity_type: clients
configs:
request_percentage: 40
producer_byte_rate: 1073741824
- entity_type: clients
entity_name: foobar-producer
configs:
request_percentage: 200
12. Slowlog
— Log requests which took longer than certain threshold to process
— Kafka has "request logging" but it leads too many of lines
— Inspired by HBase's
# RequestChannel.scala#updateRequestMetrics
+ slowLogThresholdMap.get(metricNames.head).filter(_ >= 0).filter { v =>
+ val targetTime = requestId match {
+ case ApiKeys.FETCH.id => totalTime - apiRemoteTime
+ case _ => totalTime
+ }
+
+ targetTime >= v
+ }.foreach { _ =>
+ requestLogger.warn("Slow response:%s from connection %s;totalTime:%d...
+ .format(requestDesc(true), connectionId, totalTime, requestQueueTime...
+ }
[2016-12-26 16:04:20,135] WARN Slow response:Name: FetchRequest;
Version: 2 ... ;totalTime:1817;localTime: ...
14. The disk read by delayed consumer problem
— Detection: 50x ~ 100x slower response time in 99th %ile Produce response time
— Disk read of certain amount
— Network threads' utilization was very high
15. Suspecting sendfile is taking long...
— Because: 1. disk read was occuring at that time, 2. network threads' utilization was high
$ stap -e ‘(script counting sendfile(2) duration histogram)’
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
...
8192 | 3
— Normal: 2 ~ 32us
— Outliers: 8ms ~
— (About SystemTap, see my previous presentation 3
)
3
https://www.slideshare.net/kawamuray/kafka-meetup-jp-3-engineering-apache-kafka-at-line
16. Kafka broker's thread model
— Network threads (controlled by
num.network.threads) takes
read/write of requests/
responses
— Network threads hold
established connections
exclusively for event-driven IO
— Request handler threads
(controlled by num.io.threads)
takes request processing and
IO between block device
except sendfile(2) for Fetch
requests
17. When Fetch request for data that doesn't present in
page cache occurs...
18. Problem definition
— network threads contains potentially-blocking ops
while it's supposed to work as event loop
— and we have no way to know if upcoming sendfile(2)
blocks awaiting disk read or not
19. It was the one of the worst issues we had because of:
— Completely breaks resource isolation among all
clients including producers
— Occurs naturally when one of consumers slows down
— Have to communicate with users every time to ask for
fix
— Occurs 100% when one broker restores log data from
leader
20. Solution candidates
— A: Separate network threads among clients
— => Possible, but a lot of changes required
— => Not essential because network threads should be
completely computation intensive
— B: Balance connections among network threads
— => Possible, but again a lot of changes
— => Still for first moment other connections will get
affected
— C: Make sure that data are ready on memory before the
response passed to the network thread
21. To make sure non-blocking sendfile(2) in network
threads...
— The target data must be available on page cache
22. How?
NAME
sendfile - transfer data between file descriptors
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
— sendfile(2) on Linux doesn't accepts flags for controlling
it's behavior
— Interestingly FreeBSD has such, by contribution from nginx
and Netflix 1
1
https://www.nginx.com/blog/nginx-and-netflix-contribute-new-sendfile2-to-freebsd/
23. So have to;
1. Pre-read data not available on page cache from disks,
2. and confirm the page's existence before passing
response to network threads
24. sendfile(2) to the dest /dev/null
— Calling channel.transferTo("/dev/null") (== sendfile(/
dev/null)) in request handler thread might populates
page cache?
— Tested out, and figured out there's no noticeable
performance impact
25. How could it be that harmless?
— Linux kernel internally uses splice to implement sendfile(2)
— splice requests struct file_operations to handle splice
— struct file_operations null_fops just iterates list of page pointers but not each
bytes
— => Iteration count is SIZE / PAGE_SIZE(4k)
# ./drivers/char/mem.c
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
return sd->len;
}
static ssize_t splice_write_null(struct pipe_inode_info *pipe,struct file *out,
loff_t *ppos, size_t len, unsigned int flags)
{
return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_null);
}
26. Patching broker to call sendfile(/dev/null) in request
handler threads
# FileRecords.java
@SuppressWarnings("UnnecessaryFullyQualifiedName")
private static final java.nio.file.Path DEVNULL_PATH = new File("/dev/null").toPath();
public void prepareForRead() throws IOException {
long size = Math.min(channel.size(), end) - start;
try (FileChannel devnullChannel = FileChannel.open(DEVNULL_PATH,
java.nio.file.StandardOpenOption.WRITE)) {
channel.transferTo(start, size, devnullChannel);
}
}
— Still not fully-portable because it assumes underlying
kernel's implementation detail (so we haven't
contributed...)
27. ... and more to minimize impact of increased syscall...
# Log.scala#read
@@ -585,6 +586,17 @@ class Log(@volatile var dir: File,
if(fetchInfo == null) {
entry = segments.higherEntry(entry.getKey)
} else {
+ // For last entries we assume that it is hot enough to still have all data in page cache.
+ // Most of fetch requests are fetching from the tail of the log, so this optimization
+ // should save call of readahead() + mmap() + mincore() * N significantly.
+ if (!isLastEntry && fetchInfo.records.isInstanceOf[FileRecords]) {
+ try {
+ info("Prepare Read for " + fetchInfo.records.asInstanceOf[FileRecords].file().getPath)
+ fetchInfo.records.asInstanceOf[FileRecords].prepareForRead()
+ } catch {
+ case e: Throwable => warn("failed to prepare cache for read", e)
+ }
+ }
return fetchInfo
}
— Perform cache warmup only if the read segment IS NOT the latest
— => can save unnecessary syscalls for 99% of Fetch requests
29. Conclusion
— Having fewer clusters enables us to concentriate on
reliability engineering and essential troubleshootings/fixes
— Preventive engineering enables us to keep operating
Kafka clusters in highest reliability even under high and
inexplicable load
— We've had some failures in development cluster, but
never in production cluster
— The important in operating on-premise multitenancy; not
necessary to prevent 100% of failure, but never let the
same hole to be punched again