Link to the full talk - https://youtu.be/2Rf5t2Eh6IQ
https://go.dok.community/slack
https://dok.community
ABSTRACT OF THE TALK
This talk will provide a high-level overview of Kubernetes, Helm charts and how they can be used to deploy Apache Druid clusters of any size.
We'll review how Kubernetes functionality enables resilience and self-healing, historical tiers through node group affinity, middle manager scaling through Kubernetes autoscaling to optimize ingestion capacity and some of the gotchas along the way.
BIO
Sergio Ferragut is a database veteran turned Developer Advocate at Imply. His experience includes 16 years at Teradata in professional services and engineering roles.
He has direct experience in building analytics applications spanning the retail, supply chain, pricing optimization and IoT spaces.
Sergio has worked at multiple technology start-ups including APL and Splice Machine where he helped guide product design and field messaging.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
Apache Druid supports auto-scaling of Middle Manager nodes to handle changes in data ingestion load. On Kubernetes, this can be implemented using Horizontal Pod Autoscaling based on custom metrics exposed from the Druid Overlord process, such as the number of pending/running tasks and expected number of workers. The autoscaler scales the number of Middle Manager pods between minimum and maximum thresholds to maintain a target average load percentage.
Container orchestration from theory to practiceDocker, Inc.
"Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using SwarmKit and Kubernetes as a real-world example. Gain a deeper understanding of how orchestration systems work in practice and walk away with more insights into your production applications."
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
This document discusses various technologies related to architectures, frameworks, infrastructure, services, data stores, analytics, logging and metrics. It covers Java 8 features like lambda expressions and method references. It also discusses microservices, Spring Boot basics and features, Gradle vs Maven, Swagger, AngularJS, Gulp, Jasmine, Karma, Nginx, CloudFront, Couchbase, Lambda Architecture, logging with Fluentd and Elasticsearch, metrics collection with Collectd and Statsd, and visualization with Graphite and Grafana.
DoK Talks #91- Leveraging Druid Operator to manage Apache Druid on KubernetesDoKC
Apache Druid On Kubernetes discusses Apache Druid, an analytics database for large datasets. It introduces the Druid operator, which extends Kubernetes to deploy and manage Druid clusters using custom resources. The operator handles tasks like rolling upgrades, scaling nodes, cleaning up orphaned volumes, and integrating with tools like Helm and kubectl. It allows complex Druid clusters to be deployed and managed on Kubernetes.
Splunk: Druid on Kubernetes with Druid-operatorImply
We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone.
K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
Apache Druid supports auto-scaling of Middle Manager nodes to handle changes in data ingestion load. On Kubernetes, this can be implemented using Horizontal Pod Autoscaling based on custom metrics exposed from the Druid Overlord process, such as the number of pending/running tasks and expected number of workers. The autoscaler scales the number of Middle Manager pods between minimum and maximum thresholds to maintain a target average load percentage.
Container orchestration from theory to practiceDocker, Inc.
"Join Laura Frank and Stephen Day as they explain and examine technical concepts behind container orchestration systems, like distributed consensus, object models, and node topology. These concepts build the foundation of every modern orchestration system, and each technical explanation will be illustrated using SwarmKit and Kubernetes as a real-world example. Gain a deeper understanding of how orchestration systems work in practice and walk away with more insights into your production applications."
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
This document discusses various technologies related to architectures, frameworks, infrastructure, services, data stores, analytics, logging and metrics. It covers Java 8 features like lambda expressions and method references. It also discusses microservices, Spring Boot basics and features, Gradle vs Maven, Swagger, AngularJS, Gulp, Jasmine, Karma, Nginx, CloudFront, Couchbase, Lambda Architecture, logging with Fluentd and Elasticsearch, metrics collection with Collectd and Statsd, and visualization with Graphite and Grafana.
DoK Talks #91- Leveraging Druid Operator to manage Apache Druid on KubernetesDoKC
Apache Druid On Kubernetes discusses Apache Druid, an analytics database for large datasets. It introduces the Druid operator, which extends Kubernetes to deploy and manage Druid clusters using custom resources. The operator handles tasks like rolling upgrades, scaling nodes, cleaning up orphaned volumes, and integrating with tools like Helm and kubectl. It allows complex Druid clusters to be deployed and managed on Kubernetes.
Splunk: Druid on Kubernetes with Druid-operatorImply
We went through the journey of deploying Apache Druid clusters on Kubernetes(K8s) and created a druid-operator (https://github.com/druid-io/druid-operator). This talk introduces the druid kubernetes operator, how to use it to deploy druid clusters and how it works under the hood. We will share how we use this operator to deploy Druid clusters at Splunk.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Druid is a complex stateful distributed system and a Druid cluster consists of multiple web services such as Broker, Historical, Coordinator, Overlord, MiddleManager etc each deployed with multiple replicas. Deploying a single web service on K8s requires creating few K8s resources via YAML files and it multiplies due to multiple services inside of a Druid cluster. Now doing it for multiple Druid clusters (dev, staging, production environments) makes it even more tedious and error prone.
K8s enables creation of application (such as Druid) specific extension, called “Operator”, that combines kubernetes and application specific knowledge into a reusable K8s extension that makes deploying complex applications simple.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and linear scalability.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and use of technologies like Apache Storm and Kafka.
Matteo Moretti discusses scaling PHP applications. He covers scaling the web server, sessions, database, filesystem, asynchronous tasks, and logging. The key aspects are decoupling services, using caching, moving to external services like Redis, S3, and RabbitMQ, and allowing those services to scale automatically using techniques like auto-scaling. Sharding the database is difficult to implement and should only be done if really needed.
Getting started with MariaDB? Whether it is on your laptop or server, containers are great ephemeral vessels for your applications. But what about the data that drives your business? It must survive containers coming and going, maintain its availability and reliability, and grow when you need it.
Capacity planning is a difficult challenge faced by most companies. If you have too few machines, you will not have enough compute resources available to deal with heavy loads. On the other hand, if you have too many machines, you are wasting money. This is why companies have started investing in automatically scaling services and infrastructure to minimize the amount of wasted money and resources.
In this talk, Nathan will describe how Yelp is using PaaSTA, a PaaS built on top of open source tools including Docker, Mesos, Marathon, and Chronos, to automatically and gracefully scale services and the underlying cluster. He will go into detail about how this functionality was implemented and the design designs that were made while architecting the system. He will also provide a brief comparison of how this approach differs from existing solutions.
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...Ovadiah Myrgorod
Varnish is a caching proxy usually used for high profile Drupal sites. However, configuring Varnish is not an easy task that requires a lot of work. It is even more difficult when it comes to creating a scalable cluster of Varnish nodes.
Fortunately, there is a solution. I’ve been working on kube-httpcache project (https://github.com/mittwald/kube-httpcache) that takes care of many things such as routing, scaling, broadcasting, config-reloading, etc...
If you need to run more than one instance of Varnish, this session is for you. You will learn how to:
* Launch a single instance of Varnish in Kubernetes.
* Configure Varnish for Drupal.
* Scale Varnish from 1 to N nodes as part of the cluster.
* Make your Varnish cluster resilient.
* Reload Varnish configs on the fly.
* Properly invalidate cache for multiple Varnish nodes.
This session requires some basic understanding of Docker and Kubernetes; however, I will provide some intro if you are new to it.
Join this session and enjoy!
20150704 benchmark and user experience in sahara weitingWei Ting Chen
Sahara provides a way to deploy and manage Hadoop clusters within an OpenStack cloud. It addresses common customer needs like providing an elastic environment for data processing jobs, integrating Hadoop with the existing private cloud infrastructure, and reducing costs. Key challenges include speeding up cluster provisioning times, supporting complex data workflows, optimizing storage architectures, and improving performance when using remote object storage.
Trend Micro uses Hadoop for processing large volumes of web data to quickly identify and block malicious URLs. They have expanded their Hadoop cluster significantly over time to support growing data and job volumes. They developed Hadooppet to automate deployment and management of their large, customized Hadoop distribution across hundreds of nodes. Profiling tools like Nagios, Ganglia and Splunk help monitor and troubleshoot cluster performance issues.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
The document provides steps to connect to a CloudFoundry environment and deploy a sample Predix application. It includes instructions on installing the CF CLI, logging in, listing services, creating a PostgreSQL service instance, pushing a sample app, and binding the app to the database. The steps cover common operations for deploying and managing apps on Pivotal CloudFoundry and interacting with services on Predix.
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS SummitAmazon Web Services
As the need for more computing resources has accelerated, so too have the ways in which computing have evolved. The advent of the cloud has allowed us to easily scale to suit our needs, but if we want to keep pace, we need an even more automated way to scale our infrastructure. In this session, we look at automatic scaling using Kubernetes, including how to set it up and, most important, what you should monitor in order to drive your scaling. This session is brought to you by AWS partner, Datadog.
The document provides an overview of NetApp's data deduplication technology and best practices for its implementation. It discusses how deduplication works by fingerprinting and reference pointers, its implementation using WAFL block sharing, and commands for enabling and monitoring deduplication. Best practices covered include using deduplication with SnapMirror, scheduling with backup data, and its applicability to VMware environments.
Sahara is an OpenStack project that provides an abstraction layer for provisioning and managing Apache Hadoop clusters and jobs in OpenStack clouds. It allows users to easily deploy and scale Hadoop clusters on demand without having to manage the underlying infrastructure. Sahara uses plugins to integrate various Hadoop distributions like Hortonworks Data Platform (HDP) and Cloudera Distribution including Apache Hadoop (CDH). It leverages other OpenStack services like Nova, Neutron, Swift, Cinder, Heat etc. to provision, configure and manage the Hadoop clusters and jobs.
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
This document discusses Hadoop and OpenStack Sahara. Sahara is an OpenStack project that allows users to provision and manage Hadoop clusters within OpenStack. It provides a plugin mechanism to support different Hadoop distributions like Hortonworks Data Platform (HDP). The HDP plugin fully integrates HDP clusters with Sahara using the Ambari API for cluster management. Sahara handles tasks like cluster scaling, integration with Swift for storage, and data locality. Its plugin architecture allows different Hadoop versions and distributions to be deployed and managed through Sahara.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
The document discusses an agenda for a meetup about Redis Labs and Kubernetes operators. Key points:
- An introduction to Redis Enterprise architecture and Redis Labs products.
- A discussion of "double orchestration" using Kubernetes and PKS to manage Redis clusters for performance and resource management.
- An overview of Redis Labs' Kubernetes solution using StatefulSets, services, and a custom controller.
- An introduction to operators, how they provide lifecycle management and simplify deployments compared to static YAML files or Helm.
- Details on Redis Labs' operator development process and challenges in building idempotent APIs and handling state changes and validation in the reconciliation loop.
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
Brian Brazil is an engineer passionate about reliable systems who has worked at Google SRE and Boxever. He discusses Prometheus, an open source monitoring system he helped create. Prometheus offers inclusive monitoring of services, is manageable and reliable, integrates easily with other tools, and provides powerful querying and dashboards. It is efficient, scalable, and helps provide visibility into systems through its data model and labeling.
Distributed Vector Databases - What, Why, and HowDoKC
Distributed Vector Databases - What, Why, and How - Steve Pousty, VMware
In the last two years, AI machine learning has exploded in prominence. One of the key concepts used in the modeling and storage of AI is vectors. Feeling like you should learn more and how you would use them in your data work? Wondering how you would run this distributed on Kubernetes? Then have I got a talk for you! We will start by explaining the concept of (embedding) vectors and how they are used in the AI life cycle. From there we will go into putting them into a database. We will cover the use cases where this technology makes sense. As opposed to an RDBMS, vector databases are more tightly focused and optimized for particular use cases. To ground this discussion in something more concrete, there will be hands-on demos throughout the talk. You will see the advantages to running distributed vector databases on Kubernetes infrastructure. Bring your favorite Kube infrastructure and leave with hands-on experience running AI infrastructure on Kubernetes.
Is It Safe? Security Hardening for Databases Using Kubernetes OperatorsDoKC
Is It Safe? Security Hardening for Databases Using Kubernetes Operators - Robert Hodges, Altinity
Thanks to the Operator Pattern, Kubernetes is now an outstanding platform to run databases. But to quote Marathon Man, "is it safe?" This talk is a top-level review of the database security problem in Kubernetes, standard ways that operators can mitigate threats, and a wallet-sized checklist of security features you should look for in any operator you use. Our talk is practical and focused on needs of Kubernetes developers. Join us!
More Related Content
Similar to Dok Talks #124 - Intro to Druid on Kubernetes
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and linear scalability.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and use of technologies like Apache Storm and Kafka.
Matteo Moretti discusses scaling PHP applications. He covers scaling the web server, sessions, database, filesystem, asynchronous tasks, and logging. The key aspects are decoupling services, using caching, moving to external services like Redis, S3, and RabbitMQ, and allowing those services to scale automatically using techniques like auto-scaling. Sharding the database is difficult to implement and should only be done if really needed.
Getting started with MariaDB? Whether it is on your laptop or server, containers are great ephemeral vessels for your applications. But what about the data that drives your business? It must survive containers coming and going, maintain its availability and reliability, and grow when you need it.
Capacity planning is a difficult challenge faced by most companies. If you have too few machines, you will not have enough compute resources available to deal with heavy loads. On the other hand, if you have too many machines, you are wasting money. This is why companies have started investing in automatically scaling services and infrastructure to minimize the amount of wasted money and resources.
In this talk, Nathan will describe how Yelp is using PaaSTA, a PaaS built on top of open source tools including Docker, Mesos, Marathon, and Chronos, to automatically and gracefully scale services and the underlying cluster. He will go into detail about how this functionality was implemented and the design designs that were made while architecting the system. He will also provide a brief comparison of how this approach differs from existing solutions.
Create a Varnish cluster in Kubernetes for Drupal caching - DrupalCon North A...Ovadiah Myrgorod
Varnish is a caching proxy usually used for high profile Drupal sites. However, configuring Varnish is not an easy task that requires a lot of work. It is even more difficult when it comes to creating a scalable cluster of Varnish nodes.
Fortunately, there is a solution. I’ve been working on kube-httpcache project (https://github.com/mittwald/kube-httpcache) that takes care of many things such as routing, scaling, broadcasting, config-reloading, etc...
If you need to run more than one instance of Varnish, this session is for you. You will learn how to:
* Launch a single instance of Varnish in Kubernetes.
* Configure Varnish for Drupal.
* Scale Varnish from 1 to N nodes as part of the cluster.
* Make your Varnish cluster resilient.
* Reload Varnish configs on the fly.
* Properly invalidate cache for multiple Varnish nodes.
This session requires some basic understanding of Docker and Kubernetes; however, I will provide some intro if you are new to it.
Join this session and enjoy!
20150704 benchmark and user experience in sahara weitingWei Ting Chen
Sahara provides a way to deploy and manage Hadoop clusters within an OpenStack cloud. It addresses common customer needs like providing an elastic environment for data processing jobs, integrating Hadoop with the existing private cloud infrastructure, and reducing costs. Key challenges include speeding up cluster provisioning times, supporting complex data workflows, optimizing storage architectures, and improving performance when using remote object storage.
Trend Micro uses Hadoop for processing large volumes of web data to quickly identify and block malicious URLs. They have expanded their Hadoop cluster significantly over time to support growing data and job volumes. They developed Hadooppet to automate deployment and management of their large, customized Hadoop distribution across hundreds of nodes. Profiling tools like Nagios, Ganglia and Splunk help monitor and troubleshoot cluster performance issues.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
The document provides steps to connect to a CloudFoundry environment and deploy a sample Predix application. It includes instructions on installing the CF CLI, logging in, listing services, creating a PostgreSQL service instance, pushing a sample app, and binding the app to the database. The steps cover common operations for deploying and managing apps on Pivotal CloudFoundry and interacting with services on Predix.
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
Virtualizing big data platforms like Hadoop provides organizations with agility, elasticity, and operational simplicity. It allows clusters to be quickly provisioned on demand, workloads to be independently scaled, and mixed workloads to be consolidated on shared infrastructure. This reduces costs while improving resource utilization for emerging big data use cases across many industries.
Automatically scaling your Kubernetes workloads - SVC201-S - Chicago AWS SummitAmazon Web Services
As the need for more computing resources has accelerated, so too have the ways in which computing have evolved. The advent of the cloud has allowed us to easily scale to suit our needs, but if we want to keep pace, we need an even more automated way to scale our infrastructure. In this session, we look at automatic scaling using Kubernetes, including how to set it up and, most important, what you should monitor in order to drive your scaling. This session is brought to you by AWS partner, Datadog.
The document provides an overview of NetApp's data deduplication technology and best practices for its implementation. It discusses how deduplication works by fingerprinting and reference pointers, its implementation using WAFL block sharing, and commands for enabling and monitoring deduplication. Best practices covered include using deduplication with SnapMirror, scheduling with backup data, and its applicability to VMware environments.
Sahara is an OpenStack project that provides an abstraction layer for provisioning and managing Apache Hadoop clusters and jobs in OpenStack clouds. It allows users to easily deploy and scale Hadoop clusters on demand without having to manage the underlying infrastructure. Sahara uses plugins to integrate various Hadoop distributions like Hortonworks Data Platform (HDP) and Cloudera Distribution including Apache Hadoop (CDH). It leverages other OpenStack services like Nova, Neutron, Swift, Cinder, Heat etc. to provision, configure and manage the Hadoop clusters and jobs.
Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt
This document discusses Hadoop and OpenStack Sahara. Sahara is an OpenStack project that allows users to provision and manage Hadoop clusters within OpenStack. It provides a plugin mechanism to support different Hadoop distributions like Hortonworks Data Platform (HDP). The HDP plugin fully integrates HDP clusters with Sahara using the Ambari API for cluster management. Sahara handles tasks like cluster scaling, integration with Swift for storage, and data locality. Its plugin architecture allows different Hadoop versions and distributions to be deployed and managed through Sahara.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
The document discusses an agenda for a meetup about Redis Labs and Kubernetes operators. Key points:
- An introduction to Redis Enterprise architecture and Redis Labs products.
- A discussion of "double orchestration" using Kubernetes and PKS to manage Redis clusters for performance and resource management.
- An overview of Redis Labs' Kubernetes solution using StatefulSets, services, and a custom controller.
- An introduction to operators, how they provide lifecycle management and simplify deployments compared to static YAML files or Helm.
- Details on Redis Labs' operator development process and challenges in building idempotent APIs and handling state changes and validation in the reconciliation loop.
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
Brian Brazil is an engineer passionate about reliable systems who has worked at Google SRE and Boxever. He discusses Prometheus, an open source monitoring system he helped create. Prometheus offers inclusive monitoring of services, is manageable and reliable, integrates easily with other tools, and provides powerful querying and dashboards. It is efficient, scalable, and helps provide visibility into systems through its data model and labeling.
Similar to Dok Talks #124 - Intro to Druid on Kubernetes (20)
Distributed Vector Databases - What, Why, and HowDoKC
Distributed Vector Databases - What, Why, and How - Steve Pousty, VMware
In the last two years, AI machine learning has exploded in prominence. One of the key concepts used in the modeling and storage of AI is vectors. Feeling like you should learn more and how you would use them in your data work? Wondering how you would run this distributed on Kubernetes? Then have I got a talk for you! We will start by explaining the concept of (embedding) vectors and how they are used in the AI life cycle. From there we will go into putting them into a database. We will cover the use cases where this technology makes sense. As opposed to an RDBMS, vector databases are more tightly focused and optimized for particular use cases. To ground this discussion in something more concrete, there will be hands-on demos throughout the talk. You will see the advantages to running distributed vector databases on Kubernetes infrastructure. Bring your favorite Kube infrastructure and leave with hands-on experience running AI infrastructure on Kubernetes.
Is It Safe? Security Hardening for Databases Using Kubernetes OperatorsDoKC
Is It Safe? Security Hardening for Databases Using Kubernetes Operators - Robert Hodges, Altinity
Thanks to the Operator Pattern, Kubernetes is now an outstanding platform to run databases. But to quote Marathon Man, "is it safe?" This talk is a top-level review of the database security problem in Kubernetes, standard ways that operators can mitigate threats, and a wallet-sized checklist of security features you should look for in any operator you use. Our talk is practical and focused on needs of Kubernetes developers. Join us!
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery - Shivani Gupta, Elotl & Sergey Pronin, Percona
Disaster Recovery(DR) is critical for business continuity in the face of widespread outages taking down entire data centers or cloud provider regions. DR relies on deployment to multiple locations, data replication, monitoring for failure and failover. The process is typically manual involving several moving parts, and, even in the best case, involves some downtime for end-users. A multi-cluster K8s control plane presents the opportunity to automate the DR setup as well as the failure detection and failover. Such automation can dramatically reduce RTO and improve availability for end-users. This talk (and demo) describes one such setup using the open source Percona Operator for PostgreSQL and a multi-cluster K8s orchestrator. The orchestrator will use policy driven placement to replicate the entire workload on multiple clusters (in different regions), detect failure using pluggable logic, and do failover processing by promoting the standby as well as redirecting application traffic
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Da...DoKC
Transforming Data Processing with Kubernetes: Journey Towards a Self-Serve Data Mesh - Rakesh Subramanian Suresh & Jainik Vora, Intuit
This presentation explores how Intuit uses Kubernetes with Domain-Driven Design and Data Mesh principles to transform its data processing landscape, crucial for its AI-driven expert platform. We will discuss the importance of clean data in developing robust generative artificial intelligence and how Intuit is addressing this through the creation of paved paths for data platforms running on Kubernetes. We'll examine the challenges and solutions in managing 100,000 data pipelines and 1000+ engineers interacting with data, highlighting the need for scalable solutions. We'll also discuss how Intuit uses Kubernetes to build its batch and stream processing platform, overcoming hurdles in data pipeline deployment, scheduling, orchestration, and dependency management. We'll conclude by emphasizing how this transformation, based on treating data as a product, has improved decision-making speed and accuracy across the organization and fostered a more efficient, collaborative data culture.
The State of Stateful on Kubernetes - Stateful Workloads in Kubernetes: A Deep Dive - Kaslin Fields & Michelle Au, Google
As a platform for distributed computing, Kubernetes enables users to run their workloads across machines. However data has gravity, and when workloads in Kubernetes have to share data with other applications, managing the application’s requirements can get more tricky. In this talk, we will explore what "Stateful" means from Kubernetes' perspective. We will discuss the different types of stateful workloads, and the challenges of deploying them on Kubernetes. We will also look at the features that exist in Kubernetes to support stateful workloads, as well as the features that are in the works. Key Takeaways: What is a stateful workload from Kubernetes’ perspective? What are the challenges of deploying stateful workloads on Kubernetes? What features exist in Kubernetes to support stateful workloads? What features are in the works to support stateful workloads better in the future?
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource ...DoKC
Colocating Data Workloads and Web Services on Kubernetes to Improve Resource Utilization - He Cao, ByteDance
Recently, more and more data workloads are running on top of Kubernetes, such as ETL processes, Spark and Flink jobs, and more. These workloads typically exhibit high resource utilization and remain relatively stable over time. In contrast, web services often exhibit tidal patterns, characterized by significant fluctuations in resource utilization. The resource model of vanilla Kubernetes is static, which can lead to low resource utilization accumulated over 24 hours. In this talk, He will introduce how ByteDance uses Katalyst to colocate data workloads and online services on Kubernetes to improve resource utilization. In addition, He will explain how Katalyst ensures the QoS of these workloads through QoS-aware scheduling, service profiling, multi-dimensional resource isolation, real-time container resource adjustment, and more. In ByteDance, Katalyst has been deployed on 500,000+ nodes with tens of millions of cores, and has improved daily resource utilization from 20% to 60%.
Make Your Kafka Cluster Production-Ready - Jakub Scholz, Red Hat
Kubernetes became the de-facto standard for running cloud-native applications. And more and more users turn to it also to run stateful applications such as Apache Kafka. While there are different tools such as Helm charts or operators which can get you quickly up and running, there is often still a long way to make sure the Kafka cluster is production-ready. This talk will take you through the main aspects you should consider for your Kafka cluster and will cover things such as resource management, storage, scheduling, rolling updates, or reliability. It will show you how to do it using the Strimzi operator, but the lessons learned will apply also to any other Kafka cluster. If you are interested in production-ready Apache Kafka on Kubernetes, this is a talk for you.
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...DoKC
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo Workflows and Argo Events - Ovidiu Valeanu, AWS & Vara Bonthu, Amazon
Are you eager to build and manage large-scale Spark clusters on Kubernetes for powerful data processing? Whether you are starting from scratch or considering migrating Spark workloads from existing Hadoop clusters to Kubernetes, the challenges of configuring storage, compute, networking, and optimizing job scheduling can be daunting. Join us as we unveil the best practices to construct a scalable Spark clusters on Kubernetes, with a special emphasis on leveraging Argo Workflows and Argo Events. In this talk, we will guide you through the journey of building highly scalable Spark clusters on Kubernetes, using the most popular open-source tools. We will showcase how to harness the potential of Argo Workflows and Argo Events for event-driven job scheduling, enabling efficient resource utilization and seamless scalability. By integrating these powerful tools, you will gain better control and flexibility for executing Spark jobs on Kubernetes.
Run PostgreSQL in Warp Speed Using NVMe/TCP in the CloudDoKC
Run PostgreSQL in Warp Speed Using NVMe/TCP in the Cloud - Sagy Volkov, Lightbits
PostgreSQL as a SQL engine can accommodate a very high-transaction rate, but as your data grows and the number of connections and queries increases, there is a challenge for the storage to keep up with the SQL engine.
To the rescue comes NVMe over TCP (or NVMe/TCP). Developed by Lightbits Labs in 2016 and donated to the Linux community, it is the next evaluation of using NVMe based storage over TCP Fabric. NVMe/TCP simplifies how you interact with remote NVMe devices (targets) and allows your PostgreSQL storage to consume fast storage very easily.
In this session I will explain the core concept of the NVMe/TCP protocol, current storage providers that can use it, how you can consume it in Kubernetes (super easy), and discuss the possibilities of using NVMe/TCP in the cloud.
The session will also include a performance comparison of a few storage that are available in AWS and even a live demo of how PostgreSQL can run super fast - warp speed fast - in AWS.
Link: https://www.youtube.com/watch?v=D8kJCvsHD9Q&list=PLHgdNuGxrJt04Fwaip9aDYvXrbRSmc5HZ&index=12
https://go.dok.community/slack
https://dok.community/
From DoK Day NA 2022 (https://www.youtube.com/watch?v=YWTa-DiVljY&list=PLHgdNuGxrJt04Fwaip9aDYvXrbRSmc5HZ)
In the software industry we’re fond of terms that define major trends, like “cloud native”, “Kubernetes native” and “serverless”. As more and more organizations move stateful workloads to Kubernetes, we’ve started to see these terms applied to data infrastructure, where they can get overtaken by marketing hype unless we work to define them.
In this talk, we’ll examine two different databases, TiDB and Apache Cassandra, in order to identify what it means for a database to be Kubernetes native and why it matters. We’ll look at points including:
- The differences between cloud native, Kubernetes native, and serverless
- How databases become Kubernetes native
- Benefits of Kubernetes native databases
- How Kubernetes can better support databases
-----
Jeff has worked as a software engineer and architect in multiple industries and as a developer advocate helping engineers get up to speed on Apache Cassandra. He's involved in multiple open source projects in the Cassandra and Kubernetes ecosystems including Stargate and K8ssandra. Jeff is the author of the O’Reilly books “Cassandra: The Definitive Guide" and “Managing Cloud Native Data on Kubernetes".
ING Data Services hosted on ICHP DoK Amsterdam 2023DoKC
An explanation of how ING deals with local persistence at scale in secure and compliant manner for Elastic and Prometheus workloads today and other Data Services in the future.
In more detail we will elaborate on the following topics
How we solve local persistence
Type of workloads now and in the future
Typical requirements for a banking environment
Automation
Scale
Resilience
Security / Compliance
Service offering / demarcation
About Tor and Luuk:
Tor and Luuk are experienced engineers working at ING for over 10 years and working in the Kubernetes area for the last 5 years. They are specialized in and responsible for the Data Services OpenShift clusters in ING and have a strong focus on resilience, automation and security.
Implementing data and databases on K8s within the Dutch governmentDoKC
A small walkthrough of projects within the dutch government running Data(bases) on OpenShift. This talk shares success stories, provides a proven recipe to `get it done` and debunks some of the FUD.
About Sebastiaan:
I have always been a weird DBA, trying to combine Databases with out-of-the-box thinking and a DevOps mindset. Around 2016 I fell in love with both Postgres and Kubernetes, and I then committed my life to enabling Dutch organisations with running their Database workloads CloudNative.
Over the last few years I worked as a private contractor for 2 large government agencies doing exactly that, and I want to share my and others (success stories) hoping to enable and inspire Data on Kubernetes adoption.
https://go.dok.community/slack
https://dok.community/
Link: https://youtu.be/n_thXwyJNSU
ABSTRACT OF THE TALK
Deploying Stateless applications is easy but this is not the case for Stateful applications. StatefulSets are the K8s API object that helps to manage stateful application. Learn about what Stateful sets are, how to create, How it differs from Deployments.
KEY TAKE-AWAYS FROM THE TALK
This talk is focused on basics of StatefulSet, how StatefulSet differs from Deployments, How to manage Stateful app using StatefulSet
Running PostgreSQL in Kubernetes: from day 0 to day 2 with CloudNativePG - Do...DoKC
Link: https://youtu.be/cegd3Exg05w
https://go.dok.community/slack
https://dok.community/
Gabriele Bartolini - Vice President/CTO of Cloud Native and Kubernetes, EDB
ABSTRACT OF THE TALK
Imagine this: you have a virtual infrastructure based on Kubernetes, made up of virtual data centers, possibly spread across multiple Kubernetes clusters and regions. Your infrastructure could even be hosted on premises or on different cloud service providers. Infrastructure as Code is a requirement. You’ve been tasked to run Postgres databases, alongside your applications.
The good news is that you can leverage a fully open source stack with Kubernetes, PostgreSQL and the CloudNativePG operator, and deploy your Postgres database in the same way you deploy applications.
Join me in this webinar to discover the key role that you have to make this succeed, starting from day 0 through day 2 operations.
I’ll share some examples and best practices for running Postgres databases in Kubernetes, before peeking at the new features we are developing for the months to come.
Analytics with Apache Superset and ClickHouse - DoK Talks #151DoKC
Link: https://youtu.be/Y-1uFVKDfgY
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
This talk concerns performing analytical tasks with Apache Superset with ClickHouse as the data backend. ClickHouse is a super fast database for analytical tasks, and Apache Superset is an Apache Software foundation project meant for data visualization and exploration. Performing analytical tasks using this combo is super fast since both the software are designed to be scalable and capable of handling data of petabyte scale.
Overcoming challenges with protecting and migrating data in multi-cloud K8s e...DoKC
Link: https://youtu.be/EFaRyl4HmmE
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
If you are running or planning a multi-cloud or even a multi-cluster environment, there are several considerations in implementing a data protection solution – especially if you plan on an organic home-grown, do-it-yourself option. This talk will highlight challenges and best practices around centralized management of configuration, credentials, compliance across multiple accounts, regions, providers etc. We will also highlight the deviations in CSI driver implementations of various storage vendors and cloud providers. Finally, we will cover the various recovery options available in the market today.
Kubernetes cloud services are popular since they mitigate, but do not eliminate, the difficulties of operating a Kubernetes environment. This is especially true for protecting the stateful configuration and data of your Kubernetes applications, where the inherent high-availability and infrastructure as code are not a substitute for have cloud-native backup and disaster recovery capabilities. Further, many companies now have multi-cloud strategies for their cloud-native applications. These challenges can be addressed with backup applications that are both Kubernetes managed service and multi-cloud aware in order to snapshot, copy, restore, and migrate Kubernetes workloads (resources and data) running on AKS, EKS and GKE. Capturing information from cloud accounts and how the cluster and storage resources are configured allows 1) centralized visibility into all cloud accounts and the clusters and resources in the accounts including for compliance; 2) cross-account, cross-cluster, and cross-region data restores; 3) automation of the cluster and data restores including for Dev, Test, and Production recovery use cases.
BIO
Sebastian Glab is a Cloud Architect for CloudCasa and he resides in Poland. He is responsible for integrating the different cloud providers with the CloudCasa service, and making sure that all clusters in the cloud service get discovered and protected. In his free time, he plays volleyball and develops his own projects.
Martin Phan is the Field CTO in North America for CloudCasa by Catalogic Software. With over 20+ years of experience in the software-industry, he takes pride in supporting, developing, implementing, and selling enterprise software and data protection solutions to help customer solve their backup and recovery challenges.
KEY TAKE-AWAYS FROM THE TALK
1) Challenges and best practices around centralized management of configuration, credentials, compliance across multiple accounts, regions, providers etc.
2) Advantages of cloud awareness and Kubernetes managed service awareness for application and data recovery and security
3) Examples of overcoming Container Storage Interface (CSI) deviations
4) Various recovery options available in the market today.
Evaluating Cloud Native Storage Vendors - DoK Talks #147DoKC
Link: https://youtu.be/YVXEpcSclwY
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
In a continuation of a talk given at DoK day at KubeCon EU 2022, join Dinesh Majrekar, Civo's CTO as they walk through their evaluation process of the CNCF Storage market.
Civo offers managed Kubernetes clusters powered by K3s to customers around the world. We manage thousands of Virtual Machines and stateful customer data within multiple data centres across several continents.
In late 2021, Civo had the opportunity to evaluate the CNCF storage landscape to move to a new technology stack. During the migration project, Civo evaluated Mayastor, Ondat, Ceph and Longhorn against the following metrics:
Scalability
Performance
Ease of Support
Attendants will see practical examples on how they could carry out their own similar evaluation and see some of the results of the Civo research project.
BIO
Dinesh is CTO at Civo. Having worked in the hosting industry for many years, Dinesh has a passion for creating solutions that operate at scale. This not only applies to the technology stack, but for nurturing engineers through their career.
Kubernetes Cluster Upgrade Strategies and Data: Best Practices for your State...DoKC
Link: https://youtu.be/qUW8LkxYayc
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
How do you make sure your Stateful Workloads remain available when your Kubernetes infrastructure updates? This talk will discuss different strategies of upgrading a Kubernetes cluster, and how you can manage risk for your workload. The talk will showcase demos of each upgrade strategy.
BIO
Peter is a Senior Software Engineer on GKE at Google. He works on improving Kubernetes for Stateful workloads. His main focus is on enhancing the Kubernetes ecosystem for high availability applications.
KEY TAKE-AWAYS FROM THE TALK
The mechanics of different upgrade strategies, when to apply a particular upgrade strategy depending on your Stateful workload and how to mitigate risk to your application’s availability.
We will Dok You! - The journey to adopt stateful workloads on k8sDoKC
Link: https://youtu.be/AjvwG53yLMY
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
Stateful workloads are the heart of any application, yet they remain confusing and complicated even to daily K8s practitioners. That’s why many organizations shy away from migrating their data - their prized possession - to the unfamiliar stateful realm of Kubernetes.
After meeting with many organizations in the adoption phase, I discovered what works best, what to avoid, and how critical it is to gain confidence and the right knowledge in order to successfully adopt stateful workloads.
In this talk I will demonstrate how to optimally adopt Kubernetes and stateful workloads in a few steps, based on what I’ve learned from observing dozens of different adoption journeys. If you are taking your first steps in data on K8s or contemplating where to start - this talk is for you!
BIO
- A Developer turned Solution Architect.
- Working at Komodor, a startup building the first K8s-native troubleshooting platform.
- Love everything in infrastructure: storage, networks & security - from 70’s era mainframes to cloud-native.
- All about “plan well, sleep well”.
KEY TAKE-AWAYS FROM THE TALK
- Understand how critical stateful workloads are for any system, and that the key challenges to migrating it to Kubernetes are knowledge and confidence.
- How to build the foundational knowledge required to overcome adoption challenges by creating a learning path for individuals and teams.
- How to gain confidence to run stateful workloads on Kubernetes with support from the community (and yourself!)
Mastering MongoDB on Kubernetes, the power of operators DoKC
Link: https://youtu.be/Pi5ueyl_1jU
https://go.dok.community/slack
https://dok.community/
ABSTRACT OF THE TALK
During my first talk for DoK community I want to walk you through the world of NoSQL database MongoDB and Kubernetes Operators - Community Edition, Enterprise Edition (MongoDB and Ops Manager on K8s), and Atlas operator, highlight the most important capabilities, talk about use cases and challenges, the theory will be mixed with a live demos!
BIO
I'm a SRE / NoSQL / DevOps professional. I hold CKA, CKAD, CKS, also I’m MongoDB Certified DBA and MongoDB Champion. I have experience with multiple cloud providers, Kubernetes, different types of K8s operators (Strimzi, RabbitMQ Cluster Operator), but especially MongoDB K8s Operator. I also work with KEDA. Since 2017, I have been a speaker at MongoDB conferences all around the world (USA, China, Europe).
KEY TAKE-AWAYS FROM THE TALK
I would like to share the best practices of running NoSQL database - MongoDB on Kubernetes also I want to show how to manage Atlas (MongoDB cloud) via K8s operator
https://www.mongodb.com/developer/community-champions/arkadiusz-borucki/
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Top 9 Trends in Cybersecurity for 2024.pptxdevvsandy
Security and risk management (SRM) leaders face disruptions on technological, organizational, and human fronts. Preparation and pragmatic execution are key for dealing with these disruptions and providing the right cybersecurity program.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
2. Apache Druid
on Kubernetes
Apache Druid Database Overview
Kubernetes & Helm Charts
Apache Druid’s Helm Chart
Overview
Scaling Up and Down
Auto-Scaling Ingestion
What it Doesn’t (yet) Do
3. It is a database that is:
Fully scalable
Batch and real-time data
Ad-hoc statistical queries
Low latency delivery
What is Apache Druid?
4. log search
real-time ingest
flexible schema
text search
Fully scalable
Batch and real-time data
Ad-hoc statistical queries
Low latency delivery
5. log search
real-time ingest
flexible schema
text search
timeseries
low latency ingest
time-based storage
time functions
Fully scalable
Batch and real-time data
Ad-hoc statistical queries
Low latency delivery
6. columnar
efficient storage
fast analytic queries
data distribution
log search
real-time ingest
flexible schema
text search
timeseries
low latency ingest
time-based storage
time functions
Fully scalable
Batch and real-time data
Ad-hoc statistical queries
Low latency delivery
7. columnar
efficient storage
fast analytic queries
data distribution
log search
real-time ingest
flexible schema
text search
timeseries
low latency ingest
time-based storage
time functions
High Performance
Real-time Analytics
8. Apache®, Apache Druid®, Druid®, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
The Architecture
9. The Druid Architecture
Overview & High Availability
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
Deep Storage:
- HDFS
- S3, GCP, Azure
- local ( test only)
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
10. The Druid Architecture
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
11. The Druid Architecture
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
REST API
12. The Druid Architecture
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
Streaming
data
Batch
data
13. The Druid Architecture
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
middle-
manager
middle-
manager
middle-
manager
middle-
manager
Streaming
data
Batch
data
Streaming
data
Streaming
data
14. Deep Storage:
The Druid Architecture
Data Management Processing
Query Services Data Services Master Services
broker
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
Streaming
data
Batch
data
15. Deep Storage:
The Druid Architecture
Query Processing
Query Services Data Services Master Services
broker
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
Streaming
data
Batch
data
middle-
manager
middle-
manager
REST API
18. Kubernetes Cluster
High Level Functions
Kubernetes
Control Plane
➔ Acquire/manage Nodes and
Storage
➔ Accept new object requests
➔ Schedule and manage
containers on Nodes
➔ Instantiate containers for
object deployment
➔ Monitor object state
➔ Apply application policies
◆ Restart policy
◆ Upgrade
◆ Fault tolerance
namespace my-dev
Dev Node
Operating System
Container Runtime
Container
zookeeper 2.1.4
zookeeper
Container
druid 0.22.1
coordinator
Container
druid 0.22.1
historical
Container
druid 0.22.1
overlord
Container
druid 0.22.1
broker
Container
druid 0.22.1
router
Container
druid 0.22.1
middle-manager
Container
postgresql 8.6.4
postgresql
namespace qa-test
Master Node
Operating System
Container Runtime
Container
zookeeper 2.1.4
zookeeper
Container
druid 0.22.1
coordinator
Container
postgresql 8.6.4
postgresql
Master
Operating System
Container Runtime
Container
druid 0.22.1
overlord
Master Node
Operating System
Container Runtime
Container
zookeeper 2.1.4
zookeeper
Container
druid 0.22.1
coordinator
Master
Operating System
Container Runtime
Container
zookeeper 2.1.4
zookeeper
Container
druid 0.22.1
overlord
Container
druid 0.22.1
broker
Query Node
Operating System
Container Runtime
Container
druid 0.22.1
router
Container
druid 0.22.1
broker
Query
Operating System
Container Runtime
Container
druid 0.22.1
router
Data Node
Operating System
Container Runtime
Container
druid 0.22.1
historical
Data
Operating System
Container Runtime
Container
druid 0.22.1
historical
Data Node
Operating System
Container Runtime
Container
druid 0.22.1
middle-manager
Realtime
Operating System
Container Runtime
Container
druid 0.22.1
middle-manager
19. Kubernetes provides Orchestration at Scale
● High Availability -
○ Recovery - Actively monitors and restarts pods if appropriate
○ AntiAffinity - Insures no single point of failure by placing services on separate nodes
○ Persistent storage enables fast Historical recovery
● Scalability
○ Manage individual components’ scale by changing one property
○ Autoscaling based on resource utilization
● Security -
○ Encryption
○ Ingress control & network Isolation
● Upgrades -
○ Roll out changes automatically and with controlled disruption
Why Apache Druid on Kubernetes
20. In general, it is a set of templates that describe Kubernetes objects that, in turn, provide
services & applications.
Apache Druid ® helm chart @ https://github.com/apache/druid/tree/master/helm/druid
- Dependencies - zookeeper, postgresql or mysql
- Templates for each microservice (historical, broker, middlemanager, etc.)
- Default values.yaml - these are the parameters for an installation.
Users override values to create different deployments with their own values.yaml:
A Parameterization of Complex Deployments
Helm Charts
historical:
replicaCount: 10 # scale of historical data
middleManager:
replicaCount: 6 # scale of real-time ingestion
21. Template Objects
Apache Druid Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
● Deployment - manages a set of stateless pods and
keeps them running
● Ingress - outside access
● Service - Logical persistent network access, HTTP(S)
port
22. Template Objects
Apache Druid Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
● StatefulSet - local files hold intermediate ingestion
files, so stateful helps jobs pick up where they left off..
● PodDisruptionBudget - determines how many pods
can be offline at a time -> upgrades
● Service - Logical persistent access, HTTP(S) port
● Horizontal Pod Autoscaler - controls autoscaling
23. Template Objects
Apache Druid Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
● StatefulSet - persistent storage is extremely important
at recovery time => very fast recovery.
● PodDisruptionBudget - determines how many pods
can be offline at a time -> upgrades
● Service - Logical persistent access, HTTP(S) port
● No HPA - Autoscaling is not a good idea here.
24. A very simple example
How to Use - Druid Helm Chart
An example helps, so if we deploy vanilla:
> git clone https://github.com/apache/druid
> cd druid
> helm dependency update helm/druid
> helm install helm/druid a_druid -n a_space --create-namespace
> kubectl get pods -n a_space
NAME READY STATUS RESTARTS AGE
druid-broker-744c5f46b7-5l4r7 1/1 Running 0 8m12s
druid-coordinator-7c79f9c6c9-9wlqk 1/1 Running 0 8m12s
druid-historical-0 1/1 Running 0 8m12s
druid-middle-manager-0 1/1 Running 0 8m12s
druid-postgresql-0 1/1 Running 0 8m12s
druid-router-84d7cc6d87-w546r 1/1 Running 0 8m12s
druid-zookeeper-0 1/1 Running 0 8m12s
druid-zookeeper-1 1/1 Running 0 7m41s
druid-zookeeper-2 1/1 Running 0 7m10s
25. A very simple example
How to Use - Druid Helm Chart
Create a change file like values_2_historicals.yaml:
historical:
replicaCount: 2 # scale of historical data
Best Practice ( requires helm diff add-on ) :
> helm diff upgrade -C 2 a_druid helm/druid -n a_space -f values_2_historicals.yaml
reading three way merge from env
default, druid-historical, StatefulSet (apps) has changed:
...
spec:
serviceName: druid-historical
- replicas: 1
+ replicas: 2
selector:
matchLabels:
26. A very simple example
How to Use - Druid Helm Chart
Apply the change:
> helm upgrade helm/druid a_druid -n a_space -f values_2_historicals.yaml
> kubectl get pods -n a_space
NAME READY STATUS RESTARTS AGE
druid-broker-744c5f46b7-5l4r7 1/1 Running 0 13m
druid-coordinator-7c79f9c6c9-9wlqk 1/1 Running 0 13m
druid-historical-0 1/1 Running 0 13m
druid-historical-1 0/1 Running 0 23s
druid-middle-manager-0 1/1 Running 0 13m
druid-postgresql-0 1/1 Running 0 13m
druid-router-84d7cc6d87-w546r 1/1 Running 0 13m
druid-zookeeper-0 1/1 Running 0 13m
druid-zookeeper-1 1/1 Running 0 13m
druid-zookeeper-2 1/1 Running 0 12m
27. Configuration with Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
Deep Storage:
s3
local
hdfs
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
Metadata DB:
postgresql
mysql
my_values.yaml:
configVars:
druid_storage_type
(
See also more properties @
https://druid.apache.org/doc
s/latest/configuration/index.
html#deep-storage
)
28. Configuration with Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
Deep Storage:
s3
local
hdfs
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
Metadata DB:
postgresql
mysql
my_values.yaml:
configVars:
druid_metadata_storage_type
…connector_connectURI
…connector_user
…connector_password
my_values.yaml:
configVars:
druid_storage_type
(
See also more properties @
https://druid.apache.org/doc
s/latest/configuration/index.
html#deep-storage
)
29. Configuration with Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
broker
broker
Deep Storage:
s3
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
middle-
manager
historical
middle-
manager
historical
Metadata DB:
postgresql
my_values.yaml:
<service>:
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
(
A great resource to determine good values is
@https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html
)
30. Data Ingestion and Helm Chart
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
REST API
my_values.yaml:
router:
replicaCount: 2
ingress:
enabled: True
my_values.yaml:
overlord:
replicaCount: 2
coordinator:
replicaCount: 2
31. Data Ingestion and Helm Chart
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
REST API
my_values.yaml:
middleManager:
replicaCount: 2
antiaffinity
nodeSelector
config:
druid_indexer_runn…
druid_indexer_fork…
32. Highly Available Data Ingestion
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
Streaming
data
Batch
data
kafka_ingestion.json:
{
…
“ioConfig”:{
“taskCount”: 2,
“replicas”: 2,
“taskDuration”:”PT1H”
}
}
33. Data Ingestion and Helm Chart
Data Ingestion Processing
Query Services Data Services Master Services
broker
historical
historical
middle-
manager
historical
broker
broker
Deep Storage:
router overlord
overlord
coordinator
coordinator
historical
historical
middle-
manager
historical
middle-
manager
middle-
manager
middle-
manager
middle-
manager
Streaming
data
Batch
data
Streaming
data
Streaming
data
my_values.yaml:
middleManager:
replicaCount: 6
my_values.yaml:
middleManager:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 6
metrics:
memory and cpu
thresholds
34. Deep Storage:
Historicals and Helm Chart
Data Management Processing
Query Services Data Services Master Services
broker
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
Streaming
data
Batch
data
my_values.yaml:
historical:
replicaCount: 2
antiaffinity
nodeSelector
35. Deep Storage:
Query Processing & Helm Chart
Query Services Data Services Master Services
broker
middle-
manager
historical
broker
broker
router overlord
overlord
coordinator
coordinator
middle-
manager
historical
Streaming
data
Batch
data
middle-
manager
middle-
manager
REST API
my_values.yaml:
broker:
replicaCount: 3
antiaffinity
nodeSelector
36. Summarizing
● Apache Druid is a real-time OLAP database
● Kubernetes makes deploying and managing the database easier
○ Increased availability (monitored, auto-recovered, persistent)
○ Better RTO and RPO
○ Autoscaled components for ingestion and real-time query
● helm install makes it easy to deploy many different configurations:
○ Create and manage different values.yaml for each config:
■ dev-min-cluster.yaml
■ qa-ha-cluster.yaml
■ prod-ha-cluster-autoscaling.yaml
● Changes to the configs can be applied live
■ helm diff and helm upgrade
● Not just scaling
● Rolling upgrades too
37. What can you do to help
What doesn’t it do?
● Metrics configuration - enable metrics collection and display
○ Metrics are part of Apache Druid
○ Metric-emitters have been contributed by the community
■ Influxdb-metrics-emitter, prometheus-emitter,
kafka-emitter… and many more
○ Helm chart could use a set of options to turn on metrics and
enable specific emitters.
● Multi-tier configurations are not yet enabled
○ Apache Druid support multiple temperature levels, i.e.
■ High speed SSDs vs High volume HDDs
○ Helm chart could use a dynamic tier configuration mechanism
● The Apache Druid Community :
○ You are invited!
○ Fork the repo at https://github.com/apache/druid
○ Make your changes
○ Submit a PR!
38. ASF Slack
#druid
Google Groups
https://groups.google.com/forum/#!forum/druid-user
Druid Meetups
https://www.meetup.com/pro/apache-druid/
Druid News & Info
@druidio #apachedruid @implydata
Druid Professionals Group
https://www.linkedin.com/groups/8791983/
Druid User Forum by Imply
https://www.druidforum.org
Imply Community Team
community@imply.io
&
Imply Training Program
https://learn.imply.io
39. Apache®, Apache Druid®, Druid®, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
Thank you