Transaction processing systems are generally considered easier to scale than data warehouses. Relational databases were designed for this type of workload, and there are no esoteric hardware requirements. Mostly, it is just matter of normalizing to the right degree and getting the indexes right. The major challenge in these systems is their extreme concurrency, which means that small temporary slowdowns can escalate to major issues very quickly.
In this presentation, Gwen Shapira will explain how application developers and DBAs can work together to built a scalable and stable OLTP system - using application queues, connection pools and strategic use of caches in different layers of the system.
클라우드 네이티브로의 전환이 확산되면서 애플리케이션을 상호 독립적인 최소 구성 요소로 쪼개는 마이크로서비스(microservices) 아키텍쳐가 각광받고 있는데요.
MSA는 애플리케이션의 확장이 쉽고 새로운 기능의 출시 기간을 단축시킬 수 있다는 장점이 있지만,
반면에 애플리케이션이 커지고 동일한 서비스의 여러 인스턴스가 동시에 실행되면 MSA간 통신이 복잡해 진다는 단점이 있습니다.
서비스 메쉬(Service Mesh)는 이러한 MSA의 트래픽 문제를 보완하기 위해 탄생한 기술로,
서비스 간의 네트워크 트래픽 관리에 초점을 맞춘 네트워킹 모델입니다.
서로 다른 애플리케이션이 얼마나 원활하게 상호작용하는지를 기록함으로써 커뮤니케이션을 최적화하고 애플리케이션 확장에 따른 다운 타임을 방지할 수 있습니다.
서비스 메쉬의 탄생 배경과 기능, 그리고 현재 오픈소스로 배포되어 있는 서비스 메쉬 솔루션에 대해 소개합니다.
Step1. Cloud Native Trail Map
Step2. Service Proxy, Discover, & Mesh
Step3. Service Mesh 솔루션
Step4. Service Mesh 구현화면 - Istio / linkerd
Step5. Multi-cluster (linkerd)
[금융고객을 위한 AWS re:Invent 2022 re:Cap] 3.AWS reInvent 2022 Technical Highlights...AWS Korea 금융산업팀
AWS re:Invent 2022 Technical Highlights: 혁신은 계속된다.
2022 AWS re:Invent에서발표되었던 주요한 서비스들 중에서 금융 분야에서 활용하면 좋은 서비스들을 요약하여 전달 드립니다. 급변하는 시장에서 살아남기 위해서 지속적인 혁신이 그 어느때보다도 중요한 시점입니다. 본 세션에서는 AWS에서 주도하는 IT 혁신에 대한 기술적인 내용들을 다룰 예정입니다.
송규호, Solutions Architect, AWS
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Efficient data access is key to a high-performing application. Amazon Web Services provides several database options to support modern data-driven apps and software frameworks to make developing against them easy. We look at the design of a modern serverless web app using Amazon DynamoDB, the DynamoDB Mapper, Amazon Lambda, Amazon API Gateway and the SDKs and tackle the move from relational to NoSQL data models.
Speaker: Clayton Brown, Solutions Architect, Amazon Web Services
Batch computing is a common way to run a series of programs, called batch jobs, on a large pool of shared compute resources, such as servers, virtual machines, and containers. But running batch workloads at scale is a challenging task, configuring and scaling a cluster of virtual machines to process complex batch jobs is difficult and resource intensive. In this session, we’ll discuss options and best practices for running batch jobs on AWS including AWS Batch, a fully managed batch-processing service, and building batch processing architectures with the Amazon EC2 Container Service. We’ll also discuss best practices for ensuring efficient and opportunistic scheduling, fine-grained monitoring, compute resource auto-scaling, and security for batch jobs.
클라우드 네이티브로의 전환이 확산되면서 애플리케이션을 상호 독립적인 최소 구성 요소로 쪼개는 마이크로서비스(microservices) 아키텍쳐가 각광받고 있는데요.
MSA는 애플리케이션의 확장이 쉽고 새로운 기능의 출시 기간을 단축시킬 수 있다는 장점이 있지만,
반면에 애플리케이션이 커지고 동일한 서비스의 여러 인스턴스가 동시에 실행되면 MSA간 통신이 복잡해 진다는 단점이 있습니다.
서비스 메쉬(Service Mesh)는 이러한 MSA의 트래픽 문제를 보완하기 위해 탄생한 기술로,
서비스 간의 네트워크 트래픽 관리에 초점을 맞춘 네트워킹 모델입니다.
서로 다른 애플리케이션이 얼마나 원활하게 상호작용하는지를 기록함으로써 커뮤니케이션을 최적화하고 애플리케이션 확장에 따른 다운 타임을 방지할 수 있습니다.
서비스 메쉬의 탄생 배경과 기능, 그리고 현재 오픈소스로 배포되어 있는 서비스 메쉬 솔루션에 대해 소개합니다.
Step1. Cloud Native Trail Map
Step2. Service Proxy, Discover, & Mesh
Step3. Service Mesh 솔루션
Step4. Service Mesh 구현화면 - Istio / linkerd
Step5. Multi-cluster (linkerd)
[금융고객을 위한 AWS re:Invent 2022 re:Cap] 3.AWS reInvent 2022 Technical Highlights...AWS Korea 금융산업팀
AWS re:Invent 2022 Technical Highlights: 혁신은 계속된다.
2022 AWS re:Invent에서발표되었던 주요한 서비스들 중에서 금융 분야에서 활용하면 좋은 서비스들을 요약하여 전달 드립니다. 급변하는 시장에서 살아남기 위해서 지속적인 혁신이 그 어느때보다도 중요한 시점입니다. 본 세션에서는 AWS에서 주도하는 IT 혁신에 대한 기술적인 내용들을 다룰 예정입니다.
송규호, Solutions Architect, AWS
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Efficient data access is key to a high-performing application. Amazon Web Services provides several database options to support modern data-driven apps and software frameworks to make developing against them easy. We look at the design of a modern serverless web app using Amazon DynamoDB, the DynamoDB Mapper, Amazon Lambda, Amazon API Gateway and the SDKs and tackle the move from relational to NoSQL data models.
Speaker: Clayton Brown, Solutions Architect, Amazon Web Services
Batch computing is a common way to run a series of programs, called batch jobs, on a large pool of shared compute resources, such as servers, virtual machines, and containers. But running batch workloads at scale is a challenging task, configuring and scaling a cluster of virtual machines to process complex batch jobs is difficult and resource intensive. In this session, we’ll discuss options and best practices for running batch jobs on AWS including AWS Batch, a fully managed batch-processing service, and building batch processing architectures with the Amazon EC2 Container Service. We’ll also discuss best practices for ensuring efficient and opportunistic scheduling, fine-grained monitoring, compute resource auto-scaling, and security for batch jobs.
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...Chris Richardson
This is a talk I gave at DDD SoCal.
1. Make the most of your monolith
2. Adopt microservices for the right reasons
3. It’s not just architecture
4. Get the support of the business
5. Migrate incrementally
6. Know your starting point
7. Begin with the end in mind
8. Migrate high-value modules first
9. Success is improved velocity and reliability
10. If it hurts, don’t do it
Featuring a brief overview of fault-tolerant mechanisms across various Big Data systems such as Google File system (GFS), Amazon Dynamo, Bigtable, Hadoop - Map Reduce, Facebook Cassandra along with description of an existing fault tolerant model
Integration architecture for the hybrid and multi-cloud enterprise
It is a given that most enterprises are now spread between on-premise and cloud resulting in a need to perform integration across this hybrid architecture. Furthermore, most customers are seeing, or at least predicting a multi-cloud architecture. Multiple clouds from multiple vendors, providing a variety of different platforms, which brings a whole new set of integration challenges.
We will look at how integration architecture has evolved from service oriented architecture to take advantage of cloud native technologies and microservices principles. We will also discuss how integration is affected by multi-cloud issues, what the typical resolutions are. Also available as webinar: http://ibm.biz/MultiCloudIntegrationArchitectureWebinar
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Building Real-time Push APIs Using Kafka as the Customer Facing Interface wit...HostedbyConfluent
In Mercedes-Benz we offer APIs for business customers which provide them with real-time diagnostics and status data of their Mercedes-Benz vehicles. The talk will be about how we have successfully used Kafka as the customer facing interface for our Push API, achieving the Best Automotive API 2022 (API:World) and our lessons learned after 2 years in production.
Moreover, we will discuss how we used OAuth Authentication in Kafka to offer the same authentication mechanisms for our Push APIs and for the ordinary REST APIs.
We will also see our unified API Documentation using OpenAPI and AsyncAPI and how we documented the kafka servers, topics and payloads using AsyncAPI.
Finally we will discuss some challenges and decisions in order to keep our kafka-based Push API at scale, e.g. topic architecture (number of topics, partitioning, etc.), allowed consumer groups, etc.
본 세션에서는 Amazon의 관리형 데이터베이스 서비스(RDS) 중 기존 상용데이터베이스의 5배 성능 및 1/10 가격으로도 확장성을 보장하는 Aurora에 대한 소개 및 사용법 그리고 기존의 DB에서의 마이그레이션 방법에 대해 소개해드립니다. 10월 리인벤트를 통해 동경 리전에 Aurora를 사용가능하게 되었습니다.
CSI – IT2020, IIT Mumbai, October 6th 2017
Computer Society of India, Mumbai Chapter
The presentation focuses on Microservices architecture and the comparison between MicroService with Standard Monolithic Apps and SOA based Apps. It also gives a quick outline of Domain Driven Design, Event Sourcing and CQRS, Functional Reactive Programming and comparison of SAGA pattern with 2 Phase Commit.
http://www.csimumbai.org/it2020-17/index.html
CMP315_Optimizing Network Performance for Amazon EC2 InstancesAmazon Web Services
Many customers are using Amazon EC2 instances to run applications with high performance networking requirements. In this session, we provide an overview of Amazon EC2 network performance features—such as enhanced networking, ENA, and placement groups—and discuss how we are innovating on behalf of our customers to improve networking performance in a scalable and cost-effective manner. We share best practices and performance tips for getting the best networking performance out of your Amazon EC2 instances.
Multi-Tenancy Development Challanges and Solutions (using ASP.NET Core, EF Core and other Microsoft technologies). Based on the experience on aspnetboilerplate.com framework development.
Event-Driven Microservices architecture has gained a lot of attention recently. The trend in the industry is to move away from Monolithic applications to Microservices to innovate faster. While Microservices have their benefits, implementing them is hard. This talk focuses on the challenges faced and how to solve them.
It covers topics like using Domain Driven Design to break functionality into small parts. Various communication patterns among Microservices are also discussed.
One major drawback is the problem of distributed data management, as each Microservice has its own database. Event-Driven Architecture enables a way to make microservices work together and the talks show how to use architectural patterns like Event Sourcing & CQRS to implement them.
Another implementation challenge is to manage transactions that update entities owned by multiple services in an eventually consistent fashion. This challenge is solved using sagas, which can be thought of as Long running transactions that use compensating actions to handle failures.
The objective of the talk is to show how to implement highly distributed Event Driven Microservices architecture that are scalable and easy to maintain.
For a long time in order to achieve mutual TLS between Kafka brokers and its clients we had to use long-lived certificates which is a nightmare to manage at large scale. At TransferWise, we have around 300 microservices and most of them use Kafka for the async communication, stream processing, event sourcing, etc. We wanted to implement Kafka security in a way that reduced the maintenance burden on platform teams, while making migration of diverse clients as simple as possible. In this talk we will describe how we have achieved that goal using SPIFFE with SPIRE and Envoy, requiring zero code changes on the client side.
Reactive programming is an asynchronous programming paradigm, concerned with streams of information and the propagation of changes. This differs from imperative programming, where that paradigm uses statements to change a program’s state. Reactive Architecture is nothing more than the combination of reactive programming and software architectures. Also known as reactive systems, the goal is to make the system responsive, resilient, elastic, and message-driven.
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...Amazon Web Services Korea
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우
이용욱, 삼성전자
EC2 및 ECS/EKS 등 다양한 컴퓨팅 환경 및 다양한 AWS 서비스를 활용하는 수십 종의 Microservice로 구성된 대규모 서비스 인프라를 Terraform 모듈을 이용하여 구성하고, 이를 원활하게 운영 관리하기 위해 필요한 terraform코드 구성 방법 및 Kitchen을 이용한 terraform 코드 테스팅 경험을 공유합니다.
Enhanced Dynamic Web Caching: For Scalability & Metadata ManagementDeepak Bagga
Abstract: These days web caching suffers from many problems like scalability, robustness, metadata management etc. These problems degrade the performance of the network and can also create frustrating situations for the clients. This paper discusses several web caching schemes such as Distributed Web Caching (DWC), Distributed Web Caching with Clustering (DWCC), Robust Distributed Web Caching (RDWC), Distributed Web Caching for Robustness, Low latency & Disconnection Handling (DWCRLD). Clustering improves the retrieval latency and also helps to provide load balancing in distributed environment. But this cannot ensure the scalability issues, easy handling of frequent disconnections of proxy servers and metadata management issues in the network. This paper presents a strategy that enhances the clustering scheme to provide scalability even if size of the cluster grows, easy handling of frequent disconnections of proxy servers and a structure for proper management of cluster’s metadata. Then a comparative table is given that shows its comparison with these schemes.
DDD SoCal: Decompose your monolith: Ten principles for refactoring a monolith...Chris Richardson
This is a talk I gave at DDD SoCal.
1. Make the most of your monolith
2. Adopt microservices for the right reasons
3. It’s not just architecture
4. Get the support of the business
5. Migrate incrementally
6. Know your starting point
7. Begin with the end in mind
8. Migrate high-value modules first
9. Success is improved velocity and reliability
10. If it hurts, don’t do it
Featuring a brief overview of fault-tolerant mechanisms across various Big Data systems such as Google File system (GFS), Amazon Dynamo, Bigtable, Hadoop - Map Reduce, Facebook Cassandra along with description of an existing fault tolerant model
Integration architecture for the hybrid and multi-cloud enterprise
It is a given that most enterprises are now spread between on-premise and cloud resulting in a need to perform integration across this hybrid architecture. Furthermore, most customers are seeing, or at least predicting a multi-cloud architecture. Multiple clouds from multiple vendors, providing a variety of different platforms, which brings a whole new set of integration challenges.
We will look at how integration architecture has evolved from service oriented architecture to take advantage of cloud native technologies and microservices principles. We will also discuss how integration is affected by multi-cloud issues, what the typical resolutions are. Also available as webinar: http://ibm.biz/MultiCloudIntegrationArchitectureWebinar
OLTPBenchmark is a multi-threaded load generator. The framework is designed to be able to produce variable rate, variable mixture load against any JDBC-enabled relational database. The framework also provides data collection features, e.g., per-transaction-type latency and throughput logs.
Together with the framework we provide the following OLTP/Web benchmarks:
TPC-C
Wikipedia
Synthetic Resource Stresser
Twitter
Epinions.com
TATP
AuctionMark
SEATS
YCSB
JPAB (Hibernate)
CH-benCHmark
Voter (Japanese "American Idol")
SIBench (Snapshot Isolation)
SmallBank
LinkBench
CH-benCHmark
Building Real-time Push APIs Using Kafka as the Customer Facing Interface wit...HostedbyConfluent
In Mercedes-Benz we offer APIs for business customers which provide them with real-time diagnostics and status data of their Mercedes-Benz vehicles. The talk will be about how we have successfully used Kafka as the customer facing interface for our Push API, achieving the Best Automotive API 2022 (API:World) and our lessons learned after 2 years in production.
Moreover, we will discuss how we used OAuth Authentication in Kafka to offer the same authentication mechanisms for our Push APIs and for the ordinary REST APIs.
We will also see our unified API Documentation using OpenAPI and AsyncAPI and how we documented the kafka servers, topics and payloads using AsyncAPI.
Finally we will discuss some challenges and decisions in order to keep our kafka-based Push API at scale, e.g. topic architecture (number of topics, partitioning, etc.), allowed consumer groups, etc.
본 세션에서는 Amazon의 관리형 데이터베이스 서비스(RDS) 중 기존 상용데이터베이스의 5배 성능 및 1/10 가격으로도 확장성을 보장하는 Aurora에 대한 소개 및 사용법 그리고 기존의 DB에서의 마이그레이션 방법에 대해 소개해드립니다. 10월 리인벤트를 통해 동경 리전에 Aurora를 사용가능하게 되었습니다.
CSI – IT2020, IIT Mumbai, October 6th 2017
Computer Society of India, Mumbai Chapter
The presentation focuses on Microservices architecture and the comparison between MicroService with Standard Monolithic Apps and SOA based Apps. It also gives a quick outline of Domain Driven Design, Event Sourcing and CQRS, Functional Reactive Programming and comparison of SAGA pattern with 2 Phase Commit.
http://www.csimumbai.org/it2020-17/index.html
CMP315_Optimizing Network Performance for Amazon EC2 InstancesAmazon Web Services
Many customers are using Amazon EC2 instances to run applications with high performance networking requirements. In this session, we provide an overview of Amazon EC2 network performance features—such as enhanced networking, ENA, and placement groups—and discuss how we are innovating on behalf of our customers to improve networking performance in a scalable and cost-effective manner. We share best practices and performance tips for getting the best networking performance out of your Amazon EC2 instances.
Multi-Tenancy Development Challanges and Solutions (using ASP.NET Core, EF Core and other Microsoft technologies). Based on the experience on aspnetboilerplate.com framework development.
Event-Driven Microservices architecture has gained a lot of attention recently. The trend in the industry is to move away from Monolithic applications to Microservices to innovate faster. While Microservices have their benefits, implementing them is hard. This talk focuses on the challenges faced and how to solve them.
It covers topics like using Domain Driven Design to break functionality into small parts. Various communication patterns among Microservices are also discussed.
One major drawback is the problem of distributed data management, as each Microservice has its own database. Event-Driven Architecture enables a way to make microservices work together and the talks show how to use architectural patterns like Event Sourcing & CQRS to implement them.
Another implementation challenge is to manage transactions that update entities owned by multiple services in an eventually consistent fashion. This challenge is solved using sagas, which can be thought of as Long running transactions that use compensating actions to handle failures.
The objective of the talk is to show how to implement highly distributed Event Driven Microservices architecture that are scalable and easy to maintain.
For a long time in order to achieve mutual TLS between Kafka brokers and its clients we had to use long-lived certificates which is a nightmare to manage at large scale. At TransferWise, we have around 300 microservices and most of them use Kafka for the async communication, stream processing, event sourcing, etc. We wanted to implement Kafka security in a way that reduced the maintenance burden on platform teams, while making migration of diverse clients as simple as possible. In this talk we will describe how we have achieved that goal using SPIFFE with SPIRE and Envoy, requiring zero code changes on the client side.
Reactive programming is an asynchronous programming paradigm, concerned with streams of information and the propagation of changes. This differs from imperative programming, where that paradigm uses statements to change a program’s state. Reactive Architecture is nothing more than the combination of reactive programming and software architectures. Also known as reactive systems, the goal is to make the system responsive, resilient, elastic, and message-driven.
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...Amazon Web Services Korea
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우
이용욱, 삼성전자
EC2 및 ECS/EKS 등 다양한 컴퓨팅 환경 및 다양한 AWS 서비스를 활용하는 수십 종의 Microservice로 구성된 대규모 서비스 인프라를 Terraform 모듈을 이용하여 구성하고, 이를 원활하게 운영 관리하기 위해 필요한 terraform코드 구성 방법 및 Kitchen을 이용한 terraform 코드 테스팅 경험을 공유합니다.
Enhanced Dynamic Web Caching: For Scalability & Metadata ManagementDeepak Bagga
Abstract: These days web caching suffers from many problems like scalability, robustness, metadata management etc. These problems degrade the performance of the network and can also create frustrating situations for the clients. This paper discusses several web caching schemes such as Distributed Web Caching (DWC), Distributed Web Caching with Clustering (DWCC), Robust Distributed Web Caching (RDWC), Distributed Web Caching for Robustness, Low latency & Disconnection Handling (DWCRLD). Clustering improves the retrieval latency and also helps to provide load balancing in distributed environment. But this cannot ensure the scalability issues, easy handling of frequent disconnections of proxy servers and metadata management issues in the network. This paper presents a strategy that enhances the clustering scheme to provide scalability even if size of the cluster grows, easy handling of frequent disconnections of proxy servers and a structure for proper management of cluster’s metadata. Then a comparative table is given that shows its comparison with these schemes.
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
To watch, please see:
https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html
Starting Your DevOps Journey: Practical Tips for Ops
In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand.
- Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications.
- Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures.
- Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments.
- Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success.
Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!
It’s impossible to overlook system design when it comes to tech interviews. In this article, we've covered the most frequently asked System Design interview questions in almost every IT giant.
Exploring the problem of Microservices communication and how both Kafka and Service Mesh solutions address it. We then look at some approaches for combining both.
Presentation for Papers We Love at QCON NYC 17. I didn't write the paper, good people at Facebook did. But I sure enjoyed reading it and presenting it.
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk, we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
1. Queues, Pools and Caches -
Everything a DBA should know of scaling modern OLTP
Gwen (Chen) Shapira, Senior Consultant
The Pythian Group
cshapi@gmail.com
Scalability Problems in Highly Concurrent Systems
When we drive through a particularly painful traffic jam, we tend to assume that the jam has a cause.
That road maintenance or an accident blocked traffic and created the slowdown. However, we often
reach the end of the traffic jam without seeing any visible cause.
Traffic researcher Prof. Sugiyama and his team showed that with sufficient traffic density, traffic jams
will occur with no discernible root cause. Traffic jams will even form when cars drive in constant speed
on a circular one-lane tracki.
“When a large number of vehicles, beyond the road capacity, are successively injected into the
road, the density exceeds the critical value and the free flow state becomes unstable.”ii
OLTP systems are systems built to handle large number of small transactions. In those systems the main
requirements are servicing large number of concurrent requests, with low and predictable latency. Good
scalability for OLTP system can be defined as “Achieving maximum useful concurrency from a shared
system”iii.
OLTP systems often behave exactly like traffic jams in Prof. Sugiyama’s experiments – more and more
traffic is loaded into the database. Inevitably, a traffic jam will occur, and we may not be able to find any
visible root cause for that. In a wonderful video, Andrew Holdsworth of Oracle’s Real World
Performance group shows how increasing traffic on a database server can dramatically increase latency
without any improvement in response times and how reducing the number of connections to the
database can improve performanceiv.
In this presentation, I’ll discuss several design patterns and frameworks that are used to improve
scalability by controlling concurrency in modern OTLP systems and web based architectures.
All the patterns and frameworks I’ll discuss are considered part of the software architecture. DBAs often
take little interest in the design and architecture of the applications that use the database. But
databases never operate in vacuum, DBAs who understand application design can have better dialog
with the software team when it comes to scalability, and progress beyond finger pointing and “The
database is slow” blaming. Those frameworks require sizing, capacity planning and monitoring – a task
that DBAs are better qualified for than software developers, I’ll go into details on how DBAs can help
size and monitor these systems with the database performance in mind.
2. Connection Pools
The Problem:
Scaling application servers is a well understood problem. Through use of horizontal scaling and stateless
interactions it is relatively easy to deploy enough application capacity to support even thousands of
simultaneous user requests. This scalability, however, does not extend to the database layer.
Opening and closing a database connection is a high latency operation, due to the network protocol
used between the application server and the database and the significant overhead of database
resources. Web applications and OLTP systems can't afford this latency for every user request.
The Solution:
Instead of opening a new connection for each application request, the application engine prepares a
certain number of open database connections and caches them in a connection pool.
In Java, DataSource class is a factory for creating database connections and the preferred way of getting
a connection. Java defines a generic DataSource interface, and there are many vendors that provide
their own DataSource implementations. Many, but not all the implementations also include connection
pooling.v
Using the generic DataSource interface, developers call getConnection(), and the DataSource class
provides the connection. Since the developers write the same code regardless of whether the
DataSource class they are using implements pooling or not, asking a developer whether he is using
connection pooling is not a reliable method to determine if connection pooling is used.
To make things more complicated, the developer is often unaware of which DataSource class he is using.
The DataSource implementation will be registered with the Java Naming Directory (JNDI) and can be
deployed and managed separately from the application that is using it. Finding out which DataSource is
used and how the connection pool is configured can take some digging and creativity. Most application
servers contain a configuration file called "server.xml" or "context.xml" that will contain various
resource descriptions. Search for a resource with type "javax.sql.DataSource" can find the configuration
of the DataSource class and the connection pool minimum and maximum sizes.
3. The Architecture:
Application Business Layer
Application Data Layer
JNDI
DataSource Interface
DataSource
Connection
JDBC Driver Pool
New problems:
1. When connection pools are used all users share the same schema and same sessions, tracing
can be difficult. We advise developers to use DBMS_APPLICATION_INFO to set extra information
such as username (typically in client_info field), module and action to assist in future
troubleshooting.
2. Deciding on the size of a connection pool is the biggest challenge in using connection pools to
increase scalability. As always, the thing that gets us into trouble is the thing we don’t know
that we don’t know.
Most developers are well aware that if the connection pool is too small, the database will sit idle
while users are either waiting for connections or are being turned away. Since the scalability
limitation of small connection pools are known, developers tend to avoid them by creating large
connection pools, and increasing their size at the first hint of performance problems.
4. However a too large connection pool is a much greater risk to the application scalability. Here is
what the scalability of an OLTP system typically looks likevi:
Amdahl’s law say that the scalability of the system is constrained by its serial component as the
users are waiting for shared resources such as IO and CPU (This is the contention delay), but
according to the Universal Scalability Law there is a second delay called “coherency delay” –
which is the cost of maintaining data consistency in the system, this models waits on latches and
mutexes. After a certain point, adding more users to the system will decrease throughput.
Even when throughput doesn’t increase, at the point where throughput stops growing linearly,
data starts to queue and response times suffer proportionally:
If you check the wait events for a system that is past the point of saturation, you will see very
high CPU utilization, high “log file sync” event as a result of the CPU contention and high waits
for concurrency events such as “buffer busy waits” and “library cache latch”.
5. 3. Even when the negative effects of too many concurrent users on the system are made clear,
developers still argue for oversized connection pools with the excuse that most of the
connections will be idle most of the time. There are two significant problems with this approach:
a. While we believe that most of the connections will be idle most of the time, we can’t be
certain that this will be the case. In fact, the worst performance issues I’ve seen were
caused by the application actually using the entire connection pool allocated.
This often happens when response times at the database already suffer for some
reason, and the application does not receive response in a timely manner. At this point
the application or users rerun the operation, using another connection to run the exact
same query. Soon there are hundreds of connections to the database, all attempting to
run the same queries and waiting for the same latches.
b. The oversized connection pools have to be re-established during failover events or
database restarts. The larger the connection pool is, the longer the application will take
to recover from failover event, as a result decreasing the availability of the application.
4. Connection pools typically allow setting minimum and maximum sizes for the pool. When the
application starts it will open connections until the minimum number of connections is met.
Whenever it runs out of connections, it will open new connections until it reaches the maximum
level. If connections are idle for too long, they will be closed, but never below the minimum
level. This sounds fairly reasonable, until you ask yourself - if we set the minimum to the
number of connections usually needed, when will the pool run out of connections?
A connection pool can be seen as a queue. Users arrive and are serviced by the database while
holding a connection. According to little's law the avg. number of connections used in the queue
is (avg. DB response time)*(avg. user arrival rate). It is easy to see that you will run out of
connections if the rate that users use your site increases, or if the database performance
degrades and response times increase.
If your connection pool can grow at these times, it means that it will open new connections, a
resource intensive operation as we previously noted, to a database that is already abnormally
busy. This will farther slow things down, which can lead to a vicious cycle known as "connection
storm". It is much safer to configure the connection pool to a specific size – which is the
maximum number of concurrent users that can run queries on the database with acceptable
performance. We’ll discuss later how to determine this size. This will ensure that during peak
times you will have enough connections to maximize throughput at acceptable latency, and no
more.
5. Unfortunately, even if you decide on a proper number of database connections, there is the
problem of multiple application servers. In most web architectures there are multiple web
servers, each with a separate connection pool, all connecting to the same database server. In
this case, it seems appropriate to divide the number of connections the DB will sustain by the
number of servers and size the individual pools by that number. The problem with this approach
is that load balancing is never perfect, so it is expected that some app servers will run out of
connections while others still have spare connections. In some cases the number of application
6. servers is so large that dividing the number of connections leaves less than one connection per
server.
Solutions to new problems:
As we discussed in the previous section, the key to scaling OLTP systems is by limiting the number of
concurrent connections to a number that the database can reasonably support even when they are all
active. The challenge is in determining this number.
Keeping in mind that OLTP workloads are typically CPU-bound, the number of concurrent users the
system can support is limited by the number of cores on the database server. A database with 12 cores
can typically only run 12 concurrent CPU-bound sessions.
The best way to size the connection pool is by simulating the load generated by the application.
Running a load test on the database is a great way of figuring out the maximum number of concurrent
active sessions that can be sustained by the database. This should usually be done with assistance from
the QA department, as they probably already determined the mix of various transactions that simulates
the normal operations load.
It is important to test the number of concurrently active connections the database can support at its
peak, therefore while testing it is critical to make sure that the database is indeed at full capacity and is
the bottleneck at the point when we decide the number of connections is maximal. This can be
reasonably validated by checking the CPU and IO queues at the database server and correlating with the
response times of the virtual users.
In usual performance tests, you try to decide on the maximum numbers of users the application can
support. Therefore you run the test with increasing number of virtual users, until the response times
become unacceptable. However, when attempting to determine the maximum number of connections
in the pool, you should run the test with a fixed number of users and keep increasing the number of
connections in the connection pool until the database CPU utilization goes above 60%, the wait events
go from “CPU” to concurrency events and response times become unacceptable. Typically all three of
these symptoms will start occurring at approximately the same time.
If a QA department and load testing tools are not available, it is possible to use the methodology
described by James Morle in his paper "Brewing Benchmarks" and generate load testing scripts from
trace files, which can later be replayed by SwingBench.
When running a load test is impractical, you will need to estimate the number of connections based on
available data. The factors to consider are:
1. How many cores are available on the database server?
2. How many concurrent users or threads does the application need to support?
3. When an application thread takes a connection from the pool, how much of the time is spent
holding the connection without actually running database queries? The more time the
7. application spends “just holding” the connection, the larger the pool will need to be to support
the application workload.
4. How much of the database workload is IO-bound? You can check IOWAIT on the database server
to determine this. The more IO-bound your workload is, the more concurrent users you can run
without running into concurrency contention (You will see a lot of IO contention though).
“Number of cores”x4 is a good connection pool starting point. Less if the connections are heavily utilized
by the application and there is little IO activity and more if the opposite is true.
The remaining problem is what to do if the number of application servers is large and it is inefficient to
divide the connection pool limit among the application servers. Well-architected systems usually have a
separate data layer that can be deployed on separate set of servers. This data layer should be the only
component of the application allowed to open connections to the database, and it provides data objects
to the various application server components. In this architecture, the connections are divided between
the data-layer servers, of which there are typically much fewer.
This design has three great advantages: First, the data layer usually grows much slower than the
application and rarely requires new servers to be added, which means that pools rarely require resizing.
Second, application requests can be balanced between the data servers based on the remaining pool
capacity and third, if there is a need to add application-side caching to the system (such as Memcached),
only the data layer needs modification.
8. Application Message Queues
The Problem:
By limiting the number of connections from the application servers to the database, we are preventing a
large number of queries from queuing at the database server. If the total number of connections
allowed from application servers to the database is limited to 400, the run queue on the database will
not exceed 400 (at least not by much).
We discussed in the previous section why preventing excessive concurrency in the database layer is
critical for database scalability and latency. However, we still need to discuss how the application can
deal with the user requests that arrive when there is no free database connection to handle them.
Let’s assume that we limited the connection pool to 50 connections, and due to a slow-down in the
database, all 50 connections are currently busy servicing user requests. However, new user requests are
still arriving into the system at their usual rate. What shall we do with these requests?
1. Throw away the database request and return error or static content to the user.
Some requests have to be serviced immediately. If the front page of your website can't load
within few seconds, it is not worth servicing at all. Hopefully, the database is not a critical
component in displaying these pages (we'll discuss the options when we discuss caches). If it
does depend on the database and your connection pool is currently busy, you will want to
display a static page and hope the customer will try again later.
2. Place the request in queue for later processing.
Some requests can be put aside for later processing, giving the user the impression of
immediate return. For example, if your system allows the user to request reports by email, the
request can certainly be acknowledged and queued for off-line processing. This option can be
mixed with the first option – limit the size of the queue to N requests and display error
messages for the rest.
3. Give the request extra-high priority. The application can recognize that the request arrived from
the CIO and make sure it gets to the database ahead of any other user, perhaps cancelling
several user requests to get this done.
4. Give the request extra-low priority. Some requests are so non-critical that there is no reason to
even attempt serving them with low latency. If a user uses your application to send a message
to another user, and there is no guarantee on how soon the message will arrive, it makes sense
to tell the user the message was sent while in effect waiting until a connection in the pool is idle
before attempting to serve the message. Recurring events are almost always lower priority than
one-time events: User signing up for the service is one time event, and if lost, will have
immediate business impact. Auditing user activity, on the other hand, is recurring event, and in
case of delay will have lower business impact.
5. Some requests are actually a mix of requests from different sources such as a dashboard, in
these cases it is best to display the different dashboard components as the data arrives, with
some components taking longer than others to show up.
9. In all those cases, the application is able to prioritise requests and decide on a course of action, based on
information that the database did not have at the time. It makes sense to shift the queuing to the
application when the database is highly loaded, because the application is better capable of dealing with
the excessive load.
Databases are not the only constrained resources, as application servers have their own limitations
when dealing with excess load. Typically, application servers have limited number of threads. This is
done for the same reason we limit the number of connections to the database servers - the server only
has limited number of cores and excessive number of threads will overload the server without
improving throughput. Since database requests are usually the highest latency action that is done by an
application thread, when the database is slow to response, all the application server threads can be busy
waiting for the database. The CPU on the application server will be idle while the application cannot
respond to additional user requests.
All this leads to the conclusion that from both the database perspective and the application perspective,
it is preferable to decouple the application requests from the database requests. This allows the
application to prioritise requests, hide latency and keep the application server and database server busy
but not overloaded.
The Solution:
Message queues provide an asynchronous communications protocol, meaning that the sender and
receiver of the message do not need to interact with the message queue at the same time. They can be
used by web applications and OLTP systems as a way to hide latency or variance in latency.
Java defines a common messaging API, JMS. There are multiple implementations of this API, both open
source and commercial. Oracle advanced queues are bundled with Oracle RDBMS both SE and EE at no
extra cost. These implementations differ in their feature set, supported operations, reliability and
stability. The API supports queues for point-to-point messaging with a single publisher and single
consumer. It also supports topics for publish-subscribe model where multiple consumers can subscribe
to various topics and receive the messages broadcasted with the topic.
Message queues are typically installed by system administrators as a separate server or component, just
like databases are installed and maintained. The message queue server is called "Broker", and is usually
backed by a database to ensure that messages are persistent even when the broker fails. The application
server then connects to the broker by a URL, and can publish and consume from queues by the queue
name.
10. The Architecture:
Application Business Layer Message
Queue
Application Data Layer
DataSource Interface
JNDI
DataSource
Connection
JDBC Driver Pool
New Problems:
There are some common mythologies related to queue management, which may make developers
reluctant to use them when necessaryvii:
1. It is impossible to reliably monitor queues
2. Queues are not necessary if you do proper capacity planning
3. Message queues are unnecessarily complicated. There must be a simpler way to achieve the
same goals.
Solutions to New Problems:
While queues are undeniably useful to improve throughput both at the database and application server
layers, they do complicate the architecture. Let’s tackle the myths one by one:
1. If it was indeed impossible to monitor queues, you would not monitor the CPU, load average,
average active sessions, blocking sessions, disk IO waits, latches.
All systems have many queues. The only question is - where is the queue managed and how
easy it will be to manage each specific queue.
If you use Oracle Advanced Queues, V$AQ will show you the number of messages in the queue
and the average wait for messages in the queue, which is usually all you need to determine the
status of the queue. For the more paranoid, I'd recommend adding a heartbeat monitor - insert
a monitoring message to the queue at regular intervals and check that your process can read it
from queue and the amount of time it took to arrive.
The more interesting question is what do you do with the monitoring information - at what
point will you send an alert to the on-call SA and what will you want her to do when she receives
the alert?
11. Any queuing system will have high variance in service times and arrival rates of work. If the
service time and arrival rates were constant, there will be no need for queues. The high variance
is expected to lead to spikes in system utilization, which can cause false alarms - the system is
behaving as it should, but messages are accumulating in the queue. Our goal is to give as early
as possible notice that there is a genuine issue with the system that should be resolved and not
send warnings when the system is behaving as expected.
For this end, I recommend monitoring the following parameters:
• Service time - this will be monitored at the consumer thread. The thread should track
(i.e. instrument) and log at regular intervals the average time it took to process a
message from the queue. If service time increase significantly (compared to a known
baseline, taking into account the known variance in response times), it can indicate a
slowdown in processing and should be investigated.
• Arrival rate should be monitored at the processes that are writing to the queue. How
many messages are inserted to the queue every second? This should be tracked for long
term capacity planning and to determine peak usage periods.
• Queue size - the number of messages in the queue. Using Little's Law we can measure
the amount of time a message spends in the queue (wait time) instead.
If queue size or wait time increase significantly, this can indicate a "business issue" - i.e.
impending breach of SLA. If the wait time frequently climbs to the point when SLAs are
breached, it indicates that the system is does not have enough capacity to serve the
current workloads. In this case either service times should be reduced (i.e. tuning), or
more processing servers should be added. Note that queue size can and should go up
for short periods of time, and recovering from bursts can take a while (depending on the
service utilization), so this is only an issue if the queue size is high and does not start
declining within few minutes, which will indicate that the system is not recovering.
12. • Service utilization - what percent of the time the consumer thread is busy. This can be
calculated by (arrival rate/(service time x number of consumers)).
The more the service is utilized, the higher the probability that when a new message
arrives, it will have other messages ahead of it in the queue and since R=S+W, the
service times will suffer. Since we already measure the queue size directly, the main use
of service utilization is capacity planning, and in particular detection of over-provisioned
systems. For known utilization and fixed service times, if we know the arrival rates will
grow by 50% tomorrow, you can calculate the expected effect on response timesviii:
Note that by replacing many small queues on the database server with one (or few)
centralized queue in the application, you are in a much better position to calculate
utilization and predict the effect on response times.
2. Queues are inevitable. Capacity planning or not, the fact that arrival rates and service times are
random will ensure that there will be times when requests will be queued, unless you plan to
turn away a large percentage of your business.
I suspect that what is really meant by "capacity planning will eliminate need for queues" is that
it is possible to over-provision a system in a way that the queue servers (consumers) will have
very low utilization. In this case queues will be exceedingly rare so it may make sense to throw
the queue away and have the application threads communicate with the consumers directly.
The application will then have to throw away any request that arrives when the consumers are
busy, but in this system it will almost never happen. This is “capacity planning by
overprovisioning”. I've worked on many databases that rarely exceeded 5% CPU. You'll still need
to closely monitor the service utilization to make sure you increase your capacity to keep
utilization low. I would not call this type of capacity planning "proper", though.
On the other hand, introduction of a few well defined and well understood queues will help
capacity planning. If we assume fixed server utilization, the size of the queue is proportional to
the number of servers. So on some systems; it is possible to do the capacity planning just by
examining the queue sizes.
13. 3. Message Queues are indeed a complicated and not always stable beast. Queues are a simple
concept. How did we get to a point where we need all those servers, protocols and applications
to simply create a queue?
Depending on your problem definition, it is possible that message queues are an excessive
overhead. Sometimes all you need is a memory structure and few pointers. My colleague Marc
Fielding created a high-performance queue system with a database table and two jobs. Some
developers consider the database a worse overhead and prefer to implement their queues with
a file, split and xargs. If this satisfies your requirements, then by all means, use those solutions.
In other cases, I've attempted to implement a simple queuing solution, but the requirements
kept piling up: What if we want to add more consumers? What if the consumer crashed and
only processed some of the messages it retrieved? By the time I finished tweaking my system to
address all the new requirements; it was far easier to use an existing solution. So I advise to only
use home-grown solutions if you are reasonably certain the requirements will remain simple. If
you suspect that you'll have to start dealing with multiple subscribers, which may or may not
need to retrieve the same message multiple times, which may or may not want to ack messages,
and that may or may not want to filter specific message types, then I recommend using an
existing solution.
ActiveMQ, RabbitMQ (acquired by springsource) are popular open source implementations, and
Oracle Advanced Queue is free if you already have Oracle RDBMS license. When choosing an off
the shelf message queue, it is important to understand how the system can be monitored and
make sure that queue size, wait times and availability of the queue can be tracked by your
favorite monitoring tool. If high availability is a requirement, this should also be taken into
account when choosing message queue provider, since different queue systems support
different HA options.
14. Application Caching:
The Problem:
The database is a sophisticated and well optimized caching machine, but as we saw when we discussed
connection pools, it has its limitations when it comes to scaling. One of those limitations is that a single
database machine is limited in the amount of RAM it has, so if your data working set is larger than the
amount of memory available, your application would have to access the disk occasionally. Disk access is
10,000 times slower than memory access. Even a slight increase in the amount of disk access your
queries have to perform, the type that happens naturally as your system grows, can have devastating
impact on the database performance.
With Oracle RAC, more cache memory is available by pooling together memory from multiple machines
into global cache. However, the performance improvement from the additional servers is not
proportional to what you'd see if you would add more memory to the same machine. Oracle has to
maintain cache consistency between the servers, and this introduces significant overhead. RAC can
scale, but not in every case and it requires careful application design to make this happen.
The Solution:
Memcached is a distributed, memory-only, key-value store. It can be used by the application server to
cache results of database queries that can be used multiple times. The great benefit of Memcached is
that it is distributed and can use free memory on any server, allowing for caching to be done outside of
Oracle’s scarce buffer cache. If you have 5 application servers and you allocate 1G RAM to Memcached
on each server, you have 5G of additional caching.
Memcached cache is an LRU, just like the buffer cache. If the application is trying to store a new key, and
there is no free memory, the oldest item in the cache will be evicted and its memory used for the new
key.
According to the documentation, Memcached scales very well when adding additional servers because
the servers do not communicate with each other at all. Each client has a list of available servers and the
hash function that allows it to know which server will hold the value for which key. When the
application requests data from cache, it connects to a single server and accesses exactly one key. When
a single cache node crashes, there will get more cache misses and therefore more database requests,
but the rest of the nodes will continue operating as usual.
I was unable to find any published benchmarks that confirm this claim, so I ran my own un-official
benchmark, using Amazon’s ElastiCache, a service which allows one to create a Memcached cluster and
add nodes to it.
Few comments regarding the use of Amazon’s ElastiCache and how I ran the tests:
1. Amazon’s ElastiCache is only usable from servers on Amazon’s EC2 cloud. To run the test, I
created an ElastiCache cluster with two small servers (1.3G RAM, 1 virtual core), and one EC2
15. micro node (613 MB, up to two virtual cores for short bursts) running Amazon’s Linux
distribution.
2. I ran the test using Brutisix, a Memcached load test framework, written in PHP. The test is fairly
configurable, and I ran it as follows:
• 7 gets to 3 sets read/write mix, all reads and writes were random. Values were limited
to 256 bit.
• First test ran with a key space of 10K keys, which fit easily in memory of one
Memcached node. The node was pre-warmed with the keys.
• Second test ran with the same key space, two-nodes, both pre-warmed.
• Third test was one node again, 1M keys, which do not fit in memory of one or two
nodes and no pre-warming of cache.
• Fourth test with two nodes, 1M keys. Second node added after first node was already
active.
• The first 3 tests ran for 5 minutes each, the fourth ran for 15 minutes.
• The single node tests ran with 2 threads, and the two-node tests ran with four.
3. Amazon’s cloud monitoring framework was used to monitor Memcached’s statistics. It had two
annoying properties – it did not automatically refresh, and the values it showed were always 5
minutes old. In the future, it will be worth the time to install my own monitoring software on an
EC2 node to track Memcached performance.
Here is a chart of the total number of gets we could run on each node:
16. Number of hits and misses per node:
Few conclusions from the tests I ran:
1. In the tests I ran, get latency was 2ms on AWS cluster and 0.0068 on my desktop. It appears that
the only latency you’ll experience with Memcached is the network latency.
2. The ratio of hits and misses did not affect the total throughput of the cluster. The throughput is
somewhat better with a larger key space, possibly due to fewer get collisions.
3. Throughput dropped when I added the second server, and total throughput never exceeded 60K
gets per minute. It is likely that at the configuration I ran, the client could not sustain more than
60K gets per minute.
4. 60K random reads per minute at 2ms latency is pretty impressive for two very small servers,
rented at 20 cents an hour. You will need a fairly high-end configuration to get the same
performance from your database.
By using Memcached (or other application-side caching), load on the database will be reduced, since
there are fewer connections and fewer reads. Database slowdowns will have less impact on the
application responsiveness, since on many pages most of the data arrives from cache, the page can
gradually display without the users feeling that they wait forever to get results. Even better, if the
database is unavailable, you can still maintain partial availability of the application by displaying cached
results – in the best cases, only write operations will be unavailable when the database is down.
The Architecture:
Application Business Layer Message
Queue
Memcached Application Data Layer
DataSource Interface
JNDI
DataSource
Connection
JDBC Driver Pool
17. New Problems:
Unlike Oracle's buffer cache, which is automatically used by queries, use of the application cache does
not happen automatically and requires code changes to the application. In this sense it is somewhat
similar to Oracle's result cache - it stores results by request and not data blocks automatically. The
changes required to use Memcached are usually done in the data layer. The code that queries the
database is replaced by code that only queries the database if the result was not found in the cache first.
This places the burden of properly using the cache on the developers. It is said that the only difficult
problems in computer science are naming things and cache invalidation. The purpose of this paper is not
to solve the most difficult problem in computer science, but we will offer some advice on proper use of
Memcached.
In addition, Memcached presents the usual operational questions – How big should it be, and how can it
be monitored. We will discuss capacity planning and monitoring of Memcached as well.
Solutions to new problems:
The first step in integrating Memcached into your application is to re-write the functions in your data
layer, so they will look for data in the cache before querying the database:
For example, the following:
function get_username(int userid) {
username = db_select("SELECT usename FROM users WHERE userid = ?",
userid);
return username;
}
Will be replaced by:
function get_username(int userid) {
/* first try the cache */
name = memcached_fetch("username:" + userid);
if (!name) {
/* not found : request database */
name = db_select("SELECT username FROM users WHERE userid = ?",
userid);
/* then store in cache until next get */
memcached_add("username:" + userid, username);
}
return data;
}
18. We will also need to change the code that updates the database so it will update the cache as well,
otherwise we risk serving stale data:
function update_username(int userid, string username) {
/* first update database */
result = db_execute("Update users set username=? WHERE userid=?",
userid,username);
if (result) {
/* database update successful: update cache */
memcached_set("username:" + userid, username);
}
Of course, not every function should be cached. The cache has limited size, and there is an overhead for
attempting to use the cache for data that is not actually there. The main benefits are to use the cache
for results of large or highly redundant queries.
To use the cache effectively without risking data corruption, keep the following in mind:
1. Use ASH data to find the queries that use the most database time. Queries that take significant
amount of time to execute and short queries that execute very often are good candidates for
caching. Of course many of these queries use bind variables and return different results for each
user. As we showed in the example, the bind variables can be used as part of the cache key to
store and retrieve results for each group of binds separately. Due to the LRU nature of the
cache, commonly used binds will remain and cache and get reused while infrequently used
combinations will get evicted.
2. Memcached takes large amounts of memory (the more the merrier!) but there is evidencex that
it does not scale well across large number of cores. This makes Memcached a good candidate to
share a server with an application that makes intensive use of the CPU and doesn't require as
much memory. Another option is to create multiple virtual machines on a single multi-core
server and install Memcached on all the virtual machines. However this configuration means
that you will lose most of your caching capacity with the crash of a single physical server.
3. Memcached is not durable. If you can't afford to lose specific information, store it in the
database before you store it in Memcached. This seems to imply that you can't use Memcached
to scale a system which is doing primarily large number of writes. In effect, it depends on the
exact bottlenecks. If your top wait event is "Log file sync", you can use Memcached to reduce
the total amount of work the database does, reduce the CPU load and therefore potentially
reduce "log file sync" wait.
4. Some data should be stored eventually but can be lost without critical impact to the system.
Instrumentation and logging information is definitely in this category. This information can be
stored in Memcached and written to the database in batches and infrequently.
19. 5. Consider pre-populating the cache: If you rely on Memcached to keep your performance
predictable, a crash of a Memcached server will send significant amounts of traffic to the
database and the effects on performance will be noticeable. When the server comes back, it can
take a while until all the data is loaded to the cache again, prolonging the period of reduced
performance. To improve performance in the first minutes after a restart, consider a script that
will pre-load data into the cache when the Memcached server starts.
6. Consider very carefully what to do when the data is updated:
Sometimes it is easy to simultaneously update the cache - if user changes his address and the
address is stored in the cache, update the cache immediately after updating the database. This
is the best case scenario, as the cache is kept useful through update. Memcached API contains
functions that allow changing data atomically or avoid race conditions.
When the data in the cache is actually aggregated data, it may not be possible to update it, but
will be possible to evict the current information as irrelevant and reload it to the cache when it
is next needed. This can make the cache useless when the data is updated and reloaded very
frequently.
Sometimes it isn't even possible to figure out what keys should be evicted from cache when
specific field is updated, especially if the cache contains results of complex queries. This
situation is best avoided, but can be dealt with by setting expiration time for the data, and
preparing to serve possibly-stale data for that period of time.
How big should the cache be?
• It is better to have many servers with less memory than few servers with a lot of memory. This
minimises the impact of one crashed Memcached server. Remember that there is no
performance penalty to a large number of nodes.
• Losing a Memcached instance will always send additional traffic to the database. You need to
have enough Memcached servers to make sure the extra traffic will not cause unacceptable
latency to the application.
• There are no downsides to a cache that is too large, so in general allocate to Memcached all the
memory you can afford.
• If the average number of gets per item is very low, you can safely reduce the amount of memory
allocated.
• There is no "cache size advisor" for Memcached, and it is impossible to predict the effect of
adding or reducing the cache size based on the monitoring data available from Memcached.
SimCache is a tool that based on detailed hit/miss logs for the existing Memcached can simulate
an LRU cache and predict the hit/miss ratio in various cache sizes. In many environments
keeping such detailed log is impractical, but tracking a sample of the requests could be possible
and can still be used to predict cache effects.
• Knowing the average latency of database reads under various loads and the latency of
Memcached reads should allow you to predict changes in response time as Memcached size and
its hit ratio changes. For example:
You use SimCache to see that with cache size of 10G you will have hit ratio of 95% in
20. Memcached. Memcached has latency of 1ms in your system. With 5% of the queries hitting the
database, you expect the database CPU utilization to be around 20%, almost 100% of the DB
Time on the CPU, and almost no wait time on the queue between the business and the data
layers (you tested this separately when sizing your connection pool). In this case the database
latency will be 5ms, so we expect the average latency for the data layer to be
0.95*1+0.05*5=1.2ms.
How do I monitor Memcached?
• Monitor number of items, gets, sets and misses. An increase in the number of cache misses will
definitely mean that the database load is increasing at same time, and can indicate that more
memory is necessary. Make sure that the number of gets is higher than the number of sets. If
you are setting more than getting, the cache is a waste of space. If the number of gets per item
is very low, the cache may be oversized. There is no downside to an oversized cache, but you
may want to use the memory for another purpose.
• Monitor for number of evictions. Data is evicted when the application attempts to store new
item but there is no memory left. An increase in the number of evictions can also indicate that
more memory is needed. Evicted time shows the time between the last get of the item to its
eviction. If this period is short, this is a good indication that memory shortage makes the cache
less effective.
• It is important to note that low hit rate and high number of evictions do not immediately mean
you should buy more memory. It is possible that your application is misusing the cache:
o Maybe the application sets large numbers of keys, most of which are never used again.
In this case you should reconsider the way you use the cache.
o Maybe the TTL for the keys is too short. In this case you will see low hit rate but not
many evictions.
o The application frequently attempts to get items that don't exist, perhaps due to data
purging of some sort. Consider setting the key with a "null" value, to make sure the
invalid searches do not hit the database over and over.
• Monitor for swapping. Memcached is intended to speed performance by caching data in
memory. If the data is spilled to disk, it is doing more harm than good.
• Monitor for average response time. You should see very few requests that take over 1-2ms,
longer wait times can indicate that you are hitting the maximum connection limit for the server,
or that CPU utilization on the server is too high.
• Monitor that the number of connections to the server does not come close to the max
connections settings of Memcached (configurable).
• Do not monitor "stat sizes" for statistics about size of items in cache. This locks up the entire
cache.
21. All the values I mentioned can be read from Memcached using the STAT call in its API. You can run this
command and get the results directly by telnet to port 11211. Many monitoring systems, including Cactii
and Ganglia include monitoring templates for Memcached.
i
Traffic jam without bottleneck -experimental evidence for the physical mechanism of the formation of a jam
Yuki Sugiyama, Minoru Fukui, Macoto Kikuchi, Katsuya Hasebe, Akihiro Nakayama, Katsuhiro Nishinari, Shin-ichi
Tadaki, Satoshi Yukawa New Journal of Physics, Vol.10, (2008), 033001
ii
http://www.telegraph.co.uk/science/science-news/3334754/Too-many-cars-cause-traffic-jams.html
iii
Scaling Oracle8i™: Building Highly Scalable OLTP System Architectures, James Morle
iv
http://www.youtube.com/watch?v=xNDnVOCdvQ0
v
http://docs.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/datasource.html
vi
http://www.perfdynamics.com/Manifesto/USLscalability.html
vii
http://teddziuba.com/2011/02/the-case-against-queues.html
viii
http://www.cmg.org/measureit/issues/mit62/m_62_15.html
ix
http://code.google.com/p/brutis/
x
http://assets.en.oreilly.com/1/event/44/Hidden%20Scalability%20Gotchas%20in%20Memcached%20and%20Frien
ds%20Presentation.pdf