- Understanding Time Series
- What's the Fundamental Problem
- Prometheus Solution (v1.x)
- New Design of Prometheus (v2.x)
- Data Compression Algorithm
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This is a talk on how you can monitor your microservices architecture using Prometheus and Grafana. This has easy to execute steps to get a local monitoring stack running on your local machine using docker.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
This is a talk on how you can monitor your microservices architecture using Prometheus and Grafana. This has easy to execute steps to get a local monitoring stack running on your local machine using docker.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Apache Camel v3, Camel K and Camel QuarkusClaus Ibsen
In this session, we will explore key challenges with function interactions and coordination, addressing these problems using Enterprise Integration Patterns (EIP) and modern approaches with the latest innovations from the Apache Camel community:
Apache Camel is the Swiss army knife of integration, and the most powerful integration framework. In this session you will hear about the latest features in the brand new 3rd generation.
Camel K, is a lightweight integration platform that enables Enterprise Integration Patterns to be used natively on any Kubernetes cluster. When used in combination with Knative, a framework that adds serverless building blocks to Kubernetes, and the subatomic execution environment of Quarkus, Camel K can mix serverless features such as auto-scaling, scaling to zero, and event-based communication with the outstanding integration capabilities of Apache Camel.
- Apache Camel 3
- Camel K
- Camel Quarkus
We will show how Camel K works. We’ll also use examples to demonstrate how Camel K makes it easier to connect to cloud services or enterprise applications using some of the 300 components that Camel provides.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
From Tanel Poder's Troubleshooting Complex Performance Issues series - an example of Oracle SEG$ internal segment contention due to some direct path insert activity.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Apache Camel v3, Camel K and Camel QuarkusClaus Ibsen
In this session, we will explore key challenges with function interactions and coordination, addressing these problems using Enterprise Integration Patterns (EIP) and modern approaches with the latest innovations from the Apache Camel community:
Apache Camel is the Swiss army knife of integration, and the most powerful integration framework. In this session you will hear about the latest features in the brand new 3rd generation.
Camel K, is a lightweight integration platform that enables Enterprise Integration Patterns to be used natively on any Kubernetes cluster. When used in combination with Knative, a framework that adds serverless building blocks to Kubernetes, and the subatomic execution environment of Quarkus, Camel K can mix serverless features such as auto-scaling, scaling to zero, and event-based communication with the outstanding integration capabilities of Apache Camel.
- Apache Camel 3
- Camel K
- Camel Quarkus
We will show how Camel K works. We’ll also use examples to demonstrate how Camel K makes it easier to connect to cloud services or enterprise applications using some of the 300 components that Camel provides.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular.
- Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
- Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism.
- One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
From Tanel Poder's Troubleshooting Complex Performance Issues series - an example of Oracle SEG$ internal segment contention due to some direct path insert activity.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Operating and Supporting Delta Lake in ProductionDatabricks
Delta lake is widely adopted. There are things to be aware of when dealing with petabytes of data in Delta Lake. These smart decisions can give the best efficiency and increase the adoption of Delta. Best practices like OPTIMIZE, ZORDER have to wisely chosen. We have support stories where we successfully resolved performance issues by applying the right performance strategy. There are a set of common issues or repeated questions from our strategic customers face when using Delta and in this session we cover them and how to address them.
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and Accelerated Computing (GPU and FPGA) instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
For AAA games now there is a consumer expectation that the developer has a post release strategy. This strategy goes beyond just DLC content. Users expect to receive bug fixes, balancing updates, gamemode variations and constant tuning of the game experience. So how can you architect your game technology to facilitate all of this? Stewart explains the unique patching system developed for Crysis 3 Multiplayer which allowed the team to hot-patch pretty much any asset or data used by the game. He also details the supporting telemetry, server and testing infrastructure required to support this along with some interesting lessons learned.
OSBConf 2015 | Using aws virtual tape library as storage for bacula bareos by...NETWAYS
How to set up the Amazon Web Services Virtual Tape Library Storage Gateway on-premises to cache and buffer Bacula backups to S3 and Glacier.
The VTL service behaves like a tape library connected via iSCSI, and we can set up a Bacula Storage Daemon to write our backups seamlessly to S3 backed virtual tapes.
(from the article at CAPSiDE Labs:)
http://capside.com/labs/using-aws-virtual-tape-library-vtl-storage-bacula-amazon-web-services-howto/
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayPhil Estes
A talk given at Open Container Day at O'Reilly's OSCON convention in Austin, Texas on May 9th, 2017. This talk describes an open source project, bucketbench, which can be used to compare performance, stability, and throughput of various container engines. Bucketbench currently supports docker, containerd, and runc, but can be extended to support any container runtime. This work was done in response to performance investigations by the Apache OpenWhisk team in using containers as the execution vehicle for functions in their "Functions-as-a-Service" runtime. Find out more about bucketbench here: https://github.com/estesp/bucketbench
Petabyte search at scale: understand how DataStax Enterprise search enables complex real-time multi-dimensional queries on massive datasets. This talk will cover when and why to use DSE search, best practices, data modeling and performance tuning/optimization. Also covered will be a deep dive into how DSE Search operates, and the fundamentals of bitmap indexing.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
The Internet of Things (IoT) is a revolutionary concept that connects everyday objects and devices to the internet, enabling them to communicate, collect, and exchange data. Imagine a world where your refrigerator notifies you when you’re running low on groceries, or streetlights adjust their brightness based on traffic patterns – that’s the power of IoT. In essence, IoT transforms ordinary objects into smart, interconnected devices, creating a network of endless possibilities.
Here is a blog on the role of electrical and electronics engineers in IOT. Let's dig in!!!!
For more such content visit: https://nttftrg.com/
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
2. MegaEase
Self Introduction
l 20+ years working experience for large-scale distributed system
architecture and development. Familiar with Cloud Native
computing and high concurrency / high availability architecture
solution.
l Working Experiences
l MegaEase – Cloud Native Software products as Founder
l Alibaba – AliCloud, Tmall as principle software engineer.
l Amazon – Amazon.com as senior software manager.
l Thomson Reuters – Real-time system software development Manager.
l IBM Platform – Distributed computing system as software engineer.
Weibo: @左耳朵耗子
Twitter: @haoel
Blog: http://coolshell.cn/
4. MegaEase
Understanding Time Series Data
l Data scheme
l identifier -> (t0, v0), (t1, v1), (t2, v2), (t3, v3), ....
l Prometheus Data Model
l <metric name>{<label name>=<label value>, ...}
l Typical set of series identifiers
l {__name__=“requests_total”, path=“/status”, method=“GET”, instance=”10.0.0.1:80”} @1434317560938 94355
l {__name__=“requests_total”, path=“/status”, method=“POST”, instance=”10.0.0.3:80”} @1434317561287 94934
l {__name__=“requests_total”, path=“/”, method=“GET”, instance=”10.0.0.2:80”} @1434317562344 96483
l Query
l __name__=“requests_total” - selects all series belonging to the requests_total metric.
l method=“PUT|POST” - selects all series method is PUT or POST
Metric Name Labels Timestamp Sample Value
Key - Series Value - Sample
6. MegaEase
The Fundamental Problem
l Storage problem
l IDE – spinning physically
l SSD - write amplification
l Query is much more complicated than
write
l Time series query could cause the
random read.
l Ideal Write
l Sequential writes
l Batched writes
l Ideal Read
l Same Time Series should be
sequentially
8. MegaEase
Prometheus Solution (v1.x “V2”)
l One file per time series
l Batch up 1KiB chunks in memory
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series A
└──────────┴─────────┴─────────┴─────────┴─────────┘
┌──────────┬─────────┬─────────┬─────────┬─────────┐ series B
└──────────┴─────────┴─────────┴─────────┴─────────┘
. . .
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ series XYZ
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
chunk 1 chunk 2 chunk 3 ...
l Dark Sides
l Chunk are hold in memory, it could be lost if application or node crashed.
l With several million files, inodes would be run out
l With several thousands of chunks need be persisted, could cause disk I/O so busy.
l Keep so many files open for I/O, which cause very high latency.
l Old data need be cleaned, it could cause the SSD’s write amplification
l Very big CPU/MEM/DISK resource consumption
9. MegaEase
Series Churn
l Definition
l Some time series become INACTIVE
l Some time series become ACTIVE
l Reasons
l Rolling up a number of microservice
l Kubernetes scaling the services
series
^
│ . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . . .
│ . . . . . .
│ . . . . . .
│ . . . . .
│ . . . . .
│ . . . . .
v
<-------------------- time --------------------->
11. MegaEase
Fundamental Design – V3
l Storage Layout
l 01XXXXXXX- is a data block
l ULID - like UUID but lexicographically sortable and encoding the creation time
l chunk directory
l contains the raw chucks of data points for various series(likes “V2”)
l No long a single file per series
l index – index of data
l Lots of black magic find the data by labels.
l meta.json - Readable meta data
l the state of our storage and the data it contains
l tombstone
l Deleted data will be recorded into this file, instead removing from chunk file
l wal – Write-Ahead Log
l The WAL segments would be truncated to “checkpoint.X” directory
l chunks_head – in memory data
l Notes
l The data will be persisted into disk every 2 hours
l WAL is used for data recovery.
l 2 Hours block could make the range data query efficiently
$ tree ./data
./data
├── 01BKGV7JBM69T2G1BGBGM6KB12
│ ├── chunks
│ │ ├── 000001
│ │ ├── 000002
│ │ └── 000003
│ ├── index
│ └── meta.json
├── 01BKGTZQ1SYQJTR4PB43C8PD98
│ ├── chunks
│ │ └── 000001
│ ├── index
│ └── meta.json
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K
│ ├── chunks
│ │ └── 000001
│ ├── index
│ ├── tombstones
│ └── meta.json
├── chunks_head
│ └── 000001
└── wal
├── 000000003
└── checkpoint.00000002
├── 00000000
└── 00000001
https://github.com/prometheus/prometheus/blob/release-2.25/tsdb/docs/format/README.md
File Format
12. MegaEase
Blocks – Little Database
l Partition the data into non-overlapping blocks
l Each block acts as a fully independent database
l Containing all time series data for its time window
l it has its own index and set of chunk files.
l Every block of data is immutable
l The current block can be append the data
l All new data is write to an in-memory database
l To prevent data loss, a temporary WAL is also written.
t0 t1 t2 t3 now
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌────────────┐
│ │ │ │ │ │ │ │ ┌────────────┐
│ block │ │ block │ │ block │ │ chunk_head │ <─── write ────┤ Prometheus │
│ │ │ │ │ │ │ │ └────────────┘
└───────────┘ └───────────┘ └───────────┘ └────────────┘ ^
└──────────────┴───────┬──────┴──────────────┘ │
│ query
│ │
merge ─────────────────────────────────────────────────┘
14. MegaEase
New Design’s Benefits
l Good for querying a time range
l we can easily ignore all data blocks outside of this range.
l It trivially addresses the problem of series churn by reducing the set of inspected data to begin with
l Good for disk writes
l When completing a block, we can persist the data from our in-memory database by sequentially writing just
a handful of larger files.
l Keep the good property of V2 that recent chunks
l which are queried most, are always hot in memory.
l Flexible for chunk size
l We can pick any size that makes the most sense for the individual data points and chosen compression
format.
l Deleting old data becomes extremely cheap and instantaneous.
l We merely have to delete a single directory. Remember, in the old storage we had to analyze and re-write
up to hundreds of millions of files, which could take hours to converge.
15. MegaEase
Chunk-head
l Chunk will be cut
l fills till 120 samples
l 2 hour (by default)
l Since Prometheus v2.19
l not all chunks are stored in memory
l When the chunk is cut, it will be flushed to disk and to mmap
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
16. MegaEase
Chunk head à Block
https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block/
l After some time, the chunks meet threshold
l When the Chunks range is 3hrs
l The first 2 hrs chunks ( 1,2,3,4) is compacts into a block
l Meanwhile
l The WAL is truncated at this point
l And the “checkpoint” is created!
17. MegaEase
Large file with “mmap”
l mmap stands for memory-mapped files. It is a
way to read and write files without invoking
system calls.
l It is great if multiple processes accessing data in
a read only fashion from the same file
l It allows all those processes to share the same
physical memory pages, saving a lot of memory.
l it also allows the operating system to optimize
paging operations.
User Process
File System
Page Cache
Disk
User Space
Kernel Space
Device
mmap
Direct I/O
read/write
Why mmap is faster than system calls
https://sasha-f.medium.com/why-mmap-is-faster-than-system-calls-24718e75ab37
18. MegaEase
Write-Ahead Log(WAL)
l widely used in relational databases to provide durability (D from ACID)
l Persisting every state change as a command to the append only log.
https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html
l Store each state changes as command
l A single log is appended sequentially
l Each log entry is given a unique identifier
l Roll the logs as Segmented Log
l Clean the log with Low-Water Mark
l Snapshot based (Zookeeper & ETCD)
l Time based (Kafka)
l Support Singular Update Queue
l A work queue
l A single thread
19. MegaEase
Prometheus WAL & Checkpoint
l WAL Records - includes the Series and their corresponding Samples.
l The Series record is written only once when we see it for the first time
l The Samples record is written for all write requests that contain a sample.
l WAL Truncation - Checkpoints
l Drops all the series records for series which are no longer in the Head.
l Drops all the samples which are before time T.
l Drops all the tombstone records for time ranges before T.
l Retain back remaining series, samples and tombstone records in the same way as
you find it in the WAL (in the same order as they appear in the WAL).
l WAL Replay
l Replaying the “checkpoint.X”
l Replaying the WAL X+1, X+2,… X+N
l WAL Compression
l The WAL records are not heavily compressed by Snappy
l Snappy is developed by Google based on LZ77
l It aims for very high speeds and reasonable compression. Not maximum compression or compatibility.
l It is widely used for many database – Cassandra, Couchbase, Hadoop, LevelDB, MongoDB, InfluxDB….
Source Code : https://github.com/prometheus/prometheus/tree/master/tsdb/wal
data
└── wal
├── 000000
├── 000001
├── 000002
├── 000003
├── 000004
└── 000005
data
└── wal
├── checkpoint.000003
| ├── 000000
| └── 000001
├── 000004
└── 000005
20. MegaEase
Block Compaction
l Problem
l When querying multiple blocks, we have to merge their results into an overall result.
l If we need a week-long query, it has to merge 80+ partial blocks.
l Compaction
t0 t1 t2 t3 t4 now
┌────────────┐ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 mutable │ before
└────────────┘ └──────────┘ └───────────┘ └───────────┘ └───────────┘
┌─────────────────────────────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 compacted │ │ 4 │ │ 5 mutable │ after (option A)
└─────────────────────────────────────────┘ └───────────┘ └───────────┘
┌──────────────────────────┐ ┌──────────────────────────┐ ┌───────────┐
│ 1 compacted │ │ 3 compacted │ │ 5 mutable │ after (option B)
└──────────────────────────┘ └──────────────────────────┘ └───────────┘
21. MegaEase
Retention
l Example
l Block 1 can be deleted safely, bock 2 has to keep until it fully behind the boundary.
l Block Compaction impacts
l Block compaction could make the block too large to delete.
l We need to limit the block size.
Maximum block size = 10% * retention window.
|
┌────────────┐ ┌────┼─────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ 1 │ │ 2 | │ │ 3 │ │ 4 │ │ 5 │ . . .
└────────────┘ └────┼─────┘ └───────────┘ └───────────┘ └───────────┘
|
|
retention boundary
26. MegaEase
Index
l Using inverted index for label index
l Allocate an unique ID for every series
l Look up the series by this ID, the time complexity is O(1)
l This ID is forward index.
l Construct the labels’ index
l If series ID = {2,5, 10, 29} contains app=“nginx”
l Then, the { 2, 5, 10 ,29} list is the inverted index for label “nginx”
l In Short
l Number of labels is significantly less then the number of series.
l Walking through all of the labels is not problem.
{
__name__=”requests_total”,
pod=”nginx-34534242-abc723
job=”nginx”,
path=”/api/v1/status”,
status=”200”,
method=”GET”,
}
status=”200”: 1 2 5 ...
method=”GET”: 2 3 4 5 6 9 ...
ID : 5
27. MegaEase
Sets Operation
l Considering we have the following query:
l app=“foo” AND __name__=“requests_total”
l How to do intersection with two invert index list?
l General Algorithm Interview Question
l By given two integer array, return their intersection.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 2, 9} as there intersection
l By given two integer array return their union.
l A[] = { 4, 1, 6, 7, 3, 2, 9 }
l B[] = { 11,30, 2, 70, 9}
l return { 4, 1, 6, 7, 3, 2, 9, 11, 30, 70} as there union
l Time: O(m*n) - no extra space
28. MegaEase
Sort The Array
l If we sort the array
__name__="requests_total" -> [ 999, 1000, 1001, 2000000, 2000001, 2000002, 2000003 ]
app="foo" -> [ 1, 3, 10, 11, 12, 100, 311, 320, 1000, 1001, 10002 ]
intersection => [ 1000, 1001 ]
l We can have efficient algorithm
l O(m+n) : two pointers for each array.
while (idx1 < len1 && idx2 < len2) {
if (a[idx1] > b[idx2] ) {
idx2++
} else if (a[idx1] < b[idx2] ) {
idx1++
} else {
c = append(c, a[idx1])
}
}
return c
l Series ID must be easy to sort, use
MD5 or UUID is not a good idea
( V2 use the hash ID)
l Delete the data could cause the
index rebuild.
31. MegaEase
Benchmark – CPU
l CPU usage in cores/second
l Prometheus 2.0 needs 3-10 times fewer CPU resources.
32. MegaEase
Benchmark – Disk Writes
l Disk writes in MB/second
l Prometheus 2.0 saving 97-99%.
l Prometheus 1.5 is prone to wear out SSD
33. MegaEase
Benchmark – Query Latency
l Query P99 latency in seconds
l Prometheus 1.5 the query latency increases over time as more series are stored.
35. MegaEase
Gorilla Requirements
l 2 billion unique time series identified by a string key.
l 700 million data points (time stamp and value) added per minute.
l Store data for 26 hours.
l More than 40,000 queries per second at peak.
l Reads succeed in under one millisecond.
l Support time series with 15 second granularity (4 points per minute per time series).
l Two in-memory, not co-located replicas (for disaster recovery capacity).
l Always serve reads even when a single server crashes.
l Ability to quickly scan over all in memory data.
l Support at least 2x growth per year.
85% Queries for latest 26 hours data
36. MegaEase
Key Technology
l Simple Data Model – (string key, int64 timestamp, double value)
l In memory – low latency
l High Data Compression Raito – Save 90% space
l Cache first then Disk – accept the data lost
l Stateless - Easy to scale
l Hash(key) à Shard à node
37. MegaEase
Fundamental
l Delta Encoding (aka Delta Compression)
l https://en.wikipedia.org/wiki/Delta_encoding
l Examples
l HTTP RFC 3229 “Delta encoding in HTTP”
l rsync - Delta file copying
l Online backup
l Version Control
39. MegaEase
Compression Algorithm
Compress Timestamp
D = 𝒕𝒏 − 𝒕𝒏"𝟏 − ( 𝒕𝒏"𝟏 − 𝒕𝒏"𝟐)
l D = 0, then store a single ‘0’ bit
l D = [-63, 64], ‘10’ : value (7 bits)
l D = [-255, 256], ‘110’ : value (9 bits)
l D = [-2047, 2048], ‘1110’ : value (12 bits)
l Otherwise store ‘1111’ : D (32 bits)
Compress Values (Double float)
X = 𝑽𝒊 ^ 𝑽𝒊"𝟏
l X = 0, then store a single ‘0’ bit
l X != 0,
首先计算XOR中 Leading Zeros 与 Trailing Zeros 的个数。第一个bit
值存为’1’,第二个bit值为
如果Leading Zeros与Trailing Zeros与前一个XOR值相同,则第2个bit
值存为’0’,而后,紧跟着去掉Leading Zeros与Trailing Zeros以后的
有效XOR值部分。
如果Leading Zeros与Trailing Zeros与前一个XOR值不同,则第2个bit
值存为’1’,而后,紧跟着5个bits用来描述Leading Zeros的个数,再
用6个bits来描述有效XOR值的长度,最后再存储有效XOR值部分
(这种情形下,至少产生了13个bits的冗余信息)
41. MegaEase
Open Source Implementation
l Golang
l https://github.com/dgryski/go-tsz
l Java
l https://github.com/burmanm/gorilla-tsc
l https://github.com/milpol/gorilla4j
l Rust
l https://github.com/jeromefroe/tsz-rs
l https://github.com/mheffner/rust-gorilla-tsdb
43. MegaEase
Reference
l Writing a Time Series Database from Scratch by Fabian Reinartz
https://fabxc.org/tsdb/
l Gorilla: A Fast, Scalable, In-Memory Time Series Database
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
l TSDB format
https://github.com/prometheus-junkyard/tsdb/blob/master/docs/format/README.md
l PromCon 2017: Storing 16 Bytes at Scale - Fabian Reinartz
l video: https://www.youtube.com/watch?v=b_pEevMAC3I
l slides: https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
l Ganesh Vernekar Blog - Prometheus TSDB
l (Part 1): The Head Block https://ganeshvernekar.com/blog/prometheus-tsdb-the-head-block
l (Part 2): WAL and Checkpoint https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint
l (Part 3): Memory Mapping of Head Chunks from Disk https://ganeshvernekar.com/blog/prometheus-tsdb-mmapping-head-chunks-from-disk
l (Part 4): Persistent Block and its Index https://ganeshvernekar.com/blog/prometheus-tsdb-persistent-block-and-its-index
l (Part 5): Queries https://ganeshvernekar.com/blog/prometheus-tsdb-queries
l Time-series compression algorithms, explained
l https://blog.timescale.com/blog/time-series-compression-algorithms-explained/