This document provides an overview of how to deploy a SQL Server 2019 Big Data Cluster on Kubernetes. It discusses setting up infrastructure with Ubuntu templates, installing Kubespray to manage the Kubernetes cluster lifecycle, and using azdata to deploy the Big Data Cluster. Key steps include creating an Ansible inventory, configuring storage with labels and profiles, and deploying the cluster. The document also offers tips on sizing, upgrades, and next steps like load balancing and monitoring.
Data relay introduction to big data clustersChris Adkin
Data relay introduction to SQL Server 2019 big data clusters deck, including a brief overview of containers, Kubernetes and a recorded demo available on youtube.
A look at some of the ways available to deploy Postgres in a Kubernetes cloud environment, either in small scale using simple configurations, or in larger scale using tools such as Helm charts and the Crunchy PostgreSQL Operator. A short introduction to Kubernetes will be given to explain the concepts involved, followed by examples from each deployment method and observations on the key differences.
ClickHouse on Kubernetes! By Robert Hodges, Altinity CEOAltinity Ltd
Slides from Webinar. April 16, 2019
Data services are the latest wave of applications to catch the Kubernetes bug. Altinity is pleased to introduce the ClickHouse operator, which makes it easy to run scalable data warehouses on your favorite Kubernetes distro. This webinar shows how to install the operator and bring up a new data warehouse in three simple steps. We also cover storage management, monitoring, making config changes, and other topics that will help you operate your data warehouse successfully on Kubernetes. There is time for demos and Q&A, so bring your questions. See you online!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
Deploying applications on Kubernetes is getting easier every day. From a minimal deployment to distributed service mesh enabled applications with planning and a little bit of YAML resilient cloud-native applications are the norm. In this session, Christopher Bradford and Ty Morton will help answer the following questions: - What about your data behind these apps? - Are you running those in a multi-cluster environment or sending everything back to a common location? - How do you modernize to a distributed peer-to-peer data architecture? - How do you plan for this change? - Are there pitfalls on the road to enlightened data? Join this session to explore the key concepts needed when investigating multi-cluster deployments for data. This includes: - Cluster planning - Network design - Security - Failure handling
Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB
When low latency (P99) and high performance are core requirements, what NoSQL database attributes should you consider, and what tradeoffs are key? While we live in a world of multi-CPU, multi-core servers capable of storing tens of terabytes of data, if your database isn’t architected to take advantage of this, you’re being penalized on performance or cost.
Join this webinar to learn about the critical elements for a high-performance, low-latency NoSQL database. ScyllaDB’s engineers will discuss how they addressed core database performance challenges, including the pros and cons of each, and provide a detailed explanation of the architectural principles they applied to achieve their performance objectives.
We’ll take a deep dive into the strategies applied to:
Achieve precise control over I/O and compute-intensive workloads
Avoid locks and contention on the CPU level
Bypass kernel bottlenecks
Squeeze the most out of modern multi-core hardware
Satisfy SLAs while maintaining system stability
Data relay introduction to big data clustersChris Adkin
Data relay introduction to SQL Server 2019 big data clusters deck, including a brief overview of containers, Kubernetes and a recorded demo available on youtube.
A look at some of the ways available to deploy Postgres in a Kubernetes cloud environment, either in small scale using simple configurations, or in larger scale using tools such as Helm charts and the Crunchy PostgreSQL Operator. A short introduction to Kubernetes will be given to explain the concepts involved, followed by examples from each deployment method and observations on the key differences.
ClickHouse on Kubernetes! By Robert Hodges, Altinity CEOAltinity Ltd
Slides from Webinar. April 16, 2019
Data services are the latest wave of applications to catch the Kubernetes bug. Altinity is pleased to introduce the ClickHouse operator, which makes it easy to run scalable data warehouses on your favorite Kubernetes distro. This webinar shows how to install the operator and bring up a new data warehouse in three simple steps. We also cover storage management, monitoring, making config changes, and other topics that will help you operate your data warehouse successfully on Kubernetes. There is time for demos and Q&A, so bring your questions. See you online!
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
Deploying applications on Kubernetes is getting easier every day. From a minimal deployment to distributed service mesh enabled applications with planning and a little bit of YAML resilient cloud-native applications are the norm. In this session, Christopher Bradford and Ty Morton will help answer the following questions: - What about your data behind these apps? - Are you running those in a multi-cluster environment or sending everything back to a common location? - How do you modernize to a distributed peer-to-peer data architecture? - How do you plan for this change? - Are there pitfalls on the road to enlightened data? Join this session to explore the key concepts needed when investigating multi-cluster deployments for data. This includes: - Cluster planning - Network design - Security - Failure handling
Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB
When low latency (P99) and high performance are core requirements, what NoSQL database attributes should you consider, and what tradeoffs are key? While we live in a world of multi-CPU, multi-core servers capable of storing tens of terabytes of data, if your database isn’t architected to take advantage of this, you’re being penalized on performance or cost.
Join this webinar to learn about the critical elements for a high-performance, low-latency NoSQL database. ScyllaDB’s engineers will discuss how they addressed core database performance challenges, including the pros and cons of each, and provide a detailed explanation of the architectural principles they applied to achieve their performance objectives.
We’ll take a deep dive into the strategies applied to:
Achieve precise control over I/O and compute-intensive workloads
Avoid locks and contention on the CPU level
Bypass kernel bottlenecks
Squeeze the most out of modern multi-core hardware
Satisfy SLAs while maintaining system stability
For this upcoming meetup, we welcome Patrick Eaton PhD, Systems Architect at Stackdriver, and Joey Imbasciano, Cloud Platform Engineer at Stackdriver.
What You'll Learn At This Meetup:
• Why Stackdriver chose Cassandra over other DB offerings
• Stackdriver's data pipeline that runs into Cassandra
• Operating Cassandra Running on AWS
• Stackdriver's approach to disaster recovery
Patrick and Joey will be presenting their use of Apache Cassandra at Stackdriver, some lesson's learned, technical tips and a Q&A to end the evening.
ScyllaDB is a NoSQL database compatible with Apache Cassandra, distinguishing itself by supporting millions of operations per second, per node, with predictably low latency, on similar hardware.
Achieving such speed requires a great deal of diligent, deliberate mechanical sympathy: ScyllaDB employs a totally asynchronous, share-nothing programming model, relies on its own memory allocators, and meticulously schedules all its IO requests.
In this talk we will go over the low-level details of all the techniques involved - from a log-structured memory allocator to an advanced cache design -, covering how they are implemented and how they fully utilize the hardware resources they target.
Introducing Scylla Manager: Cluster Management and Task AutomationScyllaDB
By centralizing cluster administration and automating recurring tasks, Scylla Manager brings greater predictability and control to Scylla-based environments.
In this webinar, you will learn about Scylla Manager’s recurrent repair capabilities, including why recurrent repair is critical for Scylla production cluster administration, and why keeping it manual results in errors and suboptimal performance.
We will present a demo of how to set up and run recurrent and ad-hoc repairs on a Scylla cluster, and give you a sneak peek of the Scylla Manager roadmap, which includes cluster management, rolling upgrades, and integrated monitoring.
Pachyderm: Building a Big Data Beast On KubernetesKubeAcademy
Pachyderm is a containerized data analytics solution that's completely deployed using Kubernetes. We take all the amazing tools and potential in the container ecosystem and unlock that power for massive-scale data processing. In this talk we'll show you how to leverage Docker, Kubernetes, and Pachyderm, to build incredibly robust and scalable data infrastructure. We'll start by discussing the key components of a modern data-drive company and how your infrastructure choices can have a massive impact on your product and scalability roadmap. We'll then dive into some architecture details to show how Kubernetes, Docker, and Pachyderm all work in tandem to create a cohesive data infrastructure stack. Finally, we will demonstrate some high-level use cases and powerful benefits you get from the architecture we've outlined.
KubeCon schedule link: http://sched.co/4WWA
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...CloudOps2005
Charlie Drage discussed Kubernetes on bare metal at last week's Kubernetes and Cloud Native meetup in Kitchener-Waterloo. His presentation demonstrated how to deploy Kubernetes on bare metal servers. Charlie is an active Kubernetes maintainer, and his contributions have included fixing some common issues with bare metal servers and using Ansible to build clusters with kubedm.
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
A solid backup strategy is a DBA's bread and butter. Cassandra's nodetool snapshot makes it easy to back up the SSTable files, but there remains the question of where to put them and how. Knewton's backup strategy uses Ansible for distributed backups and stores them in S3.
Unfortunately, it's all too easy to store backups that are essentially useless due to the absence of a coherent restoration strategy. This problem proved much more difficult and nuanced than taking the backups themselves. I will discuss Knewton's restoration strategy, which again leverages Ansible, yet I will focus on general principles and pitfalls to be avoided. In particular, restores necessitated modifying our backup strategy to generate cluster-wide metadata that is critical for a smooth automated restoration. Such pitfalls indicate that a restore-focused backup design leads to faster and more deterministic recovery.
About the Speaker
Joshua Wickman Database Engineer, Knewton
Dr. Joshua Wickman is currently part of the database team at Knewton, a NYC tech company focused on adaptive learning. He earned his PhD at the University of Delaware in 2012, where he studied particle physics models of the early universe. After a brief stint teaching college physics, he entered the New York tech industry in 2014 working with NoSQL, first with MongoDB and then Cassandra. He was certified in Cassandra at his first Cassandra Summit in 2015.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
Petabyte search at scale: understand how DataStax Enterprise search enables complex real-time multi-dimensional queries on massive datasets. This talk will cover when and why to use DSE search, best practices, data modeling and performance tuning/optimization. Also covered will be a deep dive into how DSE Search operates, and the fundamentals of bitmap indexing.
Beyond Ingresses - Better Traffic Management in KubernetesMark McBride
Kubernetes makes deploying code easy, but conflating deploys and releases is risky. Using smarter proxies you can dramatically reduce the risk of a release, which in turn helps you ship code to customers faster.
Scylla: 1 Million CQL operations per second per serverAvi Kivity
My Cassandra Summit 2015 presentation introducing Scylla, an open source NoSQL implementation compatible with Apache Cassandra, but 10 times faster.
De-animated
http://scylladb.com
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB
Kubernetes is a declarative system for automatically deploying, managing, and scaling server-side applications and their dependencies. In this webinar, we will introduce Kubernetes at a high level and demonstrate how to get started using Scylla with Kubernetes and Google Compute Engine.
Join us to:
Understand the principles of Kubernetes and how it solves common problems of deploying distributed applications
Explore an example configuration of Scylla with Kubernetes that can serve as a starting point for your own system.
Get insight into the performance characteristics of Scylla when it it is run in a container (e.g. Docker) and deployed via Kubernetes.
kubernetes install and practice
* Environment (bare metal installation, not using cloud service)
- VM 1 : Mater node, 30GB, 2 vCPU, 4GB Mem
- VM 2 : Worker node, 30GB, 2 vCPU, 4GB Mem
* Practice
- deploying pod, make a deployment and service
- expose service using ingress(nginx-ingress)
For this upcoming meetup, we welcome Patrick Eaton PhD, Systems Architect at Stackdriver, and Joey Imbasciano, Cloud Platform Engineer at Stackdriver.
What You'll Learn At This Meetup:
• Why Stackdriver chose Cassandra over other DB offerings
• Stackdriver's data pipeline that runs into Cassandra
• Operating Cassandra Running on AWS
• Stackdriver's approach to disaster recovery
Patrick and Joey will be presenting their use of Apache Cassandra at Stackdriver, some lesson's learned, technical tips and a Q&A to end the evening.
ScyllaDB is a NoSQL database compatible with Apache Cassandra, distinguishing itself by supporting millions of operations per second, per node, with predictably low latency, on similar hardware.
Achieving such speed requires a great deal of diligent, deliberate mechanical sympathy: ScyllaDB employs a totally asynchronous, share-nothing programming model, relies on its own memory allocators, and meticulously schedules all its IO requests.
In this talk we will go over the low-level details of all the techniques involved - from a log-structured memory allocator to an advanced cache design -, covering how they are implemented and how they fully utilize the hardware resources they target.
Introducing Scylla Manager: Cluster Management and Task AutomationScyllaDB
By centralizing cluster administration and automating recurring tasks, Scylla Manager brings greater predictability and control to Scylla-based environments.
In this webinar, you will learn about Scylla Manager’s recurrent repair capabilities, including why recurrent repair is critical for Scylla production cluster administration, and why keeping it manual results in errors and suboptimal performance.
We will present a demo of how to set up and run recurrent and ad-hoc repairs on a Scylla cluster, and give you a sneak peek of the Scylla Manager roadmap, which includes cluster management, rolling upgrades, and integrated monitoring.
Pachyderm: Building a Big Data Beast On KubernetesKubeAcademy
Pachyderm is a containerized data analytics solution that's completely deployed using Kubernetes. We take all the amazing tools and potential in the container ecosystem and unlock that power for massive-scale data processing. In this talk we'll show you how to leverage Docker, Kubernetes, and Pachyderm, to build incredibly robust and scalable data infrastructure. We'll start by discussing the key components of a modern data-drive company and how your infrastructure choices can have a massive impact on your product and scalability roadmap. We'll then dive into some architecture details to show how Kubernetes, Docker, and Pachyderm all work in tandem to create a cohesive data infrastructure stack. Finally, we will demonstrate some high-level use cases and powerful benefits you get from the architecture we've outlined.
KubeCon schedule link: http://sched.co/4WWA
Kubernetes on Bare Metal at the Kitchener-Waterloo Kubernetes and Cloud Nativ...CloudOps2005
Charlie Drage discussed Kubernetes on bare metal at last week's Kubernetes and Cloud Native meetup in Kitchener-Waterloo. His presentation demonstrated how to deploy Kubernetes on bare metal servers. Charlie is an active Kubernetes maintainer, and his contributions have included fixing some common issues with bare metal servers and using Ansible to build clusters with kubedm.
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
A solid backup strategy is a DBA's bread and butter. Cassandra's nodetool snapshot makes it easy to back up the SSTable files, but there remains the question of where to put them and how. Knewton's backup strategy uses Ansible for distributed backups and stores them in S3.
Unfortunately, it's all too easy to store backups that are essentially useless due to the absence of a coherent restoration strategy. This problem proved much more difficult and nuanced than taking the backups themselves. I will discuss Knewton's restoration strategy, which again leverages Ansible, yet I will focus on general principles and pitfalls to be avoided. In particular, restores necessitated modifying our backup strategy to generate cluster-wide metadata that is critical for a smooth automated restoration. Such pitfalls indicate that a restore-focused backup design leads to faster and more deterministic recovery.
About the Speaker
Joshua Wickman Database Engineer, Knewton
Dr. Joshua Wickman is currently part of the database team at Knewton, a NYC tech company focused on adaptive learning. He earned his PhD at the University of Delaware in 2012, where he studied particle physics models of the early universe. After a brief stint teaching college physics, he entered the New York tech industry in 2014 working with NoSQL, first with MongoDB and then Cassandra. He was certified in Cassandra at his first Cassandra Summit in 2015.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
Petabyte search at scale: understand how DataStax Enterprise search enables complex real-time multi-dimensional queries on massive datasets. This talk will cover when and why to use DSE search, best practices, data modeling and performance tuning/optimization. Also covered will be a deep dive into how DSE Search operates, and the fundamentals of bitmap indexing.
Beyond Ingresses - Better Traffic Management in KubernetesMark McBride
Kubernetes makes deploying code easy, but conflating deploys and releases is risky. Using smarter proxies you can dramatically reduce the risk of a release, which in turn helps you ship code to customers faster.
Scylla: 1 Million CQL operations per second per serverAvi Kivity
My Cassandra Summit 2015 presentation introducing Scylla, an open source NoSQL implementation compatible with Apache Cassandra, but 10 times faster.
De-animated
http://scylladb.com
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
Presented at the webinar, July 31, 2019
Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!
Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB
Kubernetes is a declarative system for automatically deploying, managing, and scaling server-side applications and their dependencies. In this webinar, we will introduce Kubernetes at a high level and demonstrate how to get started using Scylla with Kubernetes and Google Compute Engine.
Join us to:
Understand the principles of Kubernetes and how it solves common problems of deploying distributed applications
Explore an example configuration of Scylla with Kubernetes that can serve as a starting point for your own system.
Get insight into the performance characteristics of Scylla when it it is run in a container (e.g. Docker) and deployed via Kubernetes.
kubernetes install and practice
* Environment (bare metal installation, not using cloud service)
- VM 1 : Mater node, 30GB, 2 vCPU, 4GB Mem
- VM 2 : Worker node, 30GB, 2 vCPU, 4GB Mem
* Practice
- deploying pod, make a deployment and service
- expose service using ingress(nginx-ingress)
Tuesday, July 30th session of the vBrownBag OpenStack Sack Lunch Series: Couch to OpenStack. We cover Nova, the Compute Service that deploys and runs VMs.
Presentation at March 2019 Dutch Postgres User Group Meetup on lessons learnt while migrating from Oracle to Postgres, demo'ed via vagrant test environments and using generic pgbench datasets.
An Ensemble Core with Docker - Solving a Real Pain in the PaaS Erik Osterman
Docker by itself is only an engine powering containers. You need a containership to run it in production. CoreOS is a purpose-built containership that powers Docker conatiners, however, without higher-level orchestration managing hundreds or thousands of containers is not manageable. Ensemble is the answer for running containers at scale on top of CoreOS.
This document is a presentation from OpenStack Summit Sydney. It describes how to easily install OpenStack on Kubernetes. It explains Kubernetes and OpenStack-Helm.
How to create a secured cloudera clusterTiago Simões
This presentation, it’s for everyone that is curious with Big Data and does have the know how to start learning...
With this, you will be able to create quickly a Kerberos secured Cloudera Cluster.
About docker cluster management tools
1. Base concepts of cluster
management and docker
2. Docker Swarm
3. Amazon EC2 Container Service
4. Kubernetes
5. Mesosphere
The traditional topics of memory pressure, page life expectancy and memory grants have been covered to the point of saturation in the SQL community, in this deck I want to cover some topics relating to memory and SQL Server which might be considered "Off-piste" but are as equally relevant if not more so in terms of getting the best possible performance out of SQL Server.
Scaling sql server 2014 parallel insertChris Adkin
A slide deck on how to get the best possible performance out of the parallel insert feature introduced in SQL Server 2014 as presented at SQL Bits XIV.
T-Sql programming guidelines, in terms of:-
1. Commenting code
2. Code readability
3. General good practise
4. Defensive coding and error handling
5. Coding for performance and scalability
A presentation on best practices for J2EE scalability from requirements gathering through to implementation, including design and architecture along the way.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. Who Am I ?
▪ SQL Server Solution Architect at Pure Storage
▪ SQL Server user for 20 years
▪ Was heavily involved in the SQL Server 2019 EAP
▪ Co-author of the Microsoft workshop:
Big Data Clusters: From Bare Metal to Kubernetes
3. Why This Session ?
I’d like to deploy a Big Data Cluster,
are there any gotchas
I need to be aware of ?
Most orgs are familiar with Windows
and VMware as platforms, Kubernetes
and Linux, not so much
5. What We Will End Up With
Cluster Build Host
K8s Master 1
K8s Master 2
K8s Worker 1
K8s Worker 2
K8s Worker 3
kubespray, ansible, git, kubectl and azdata
Kubernetes cluster
SQL Server 2019
Big Data Cluster running
on the three worker
nodes
3 node etcd cluster
10. Template Creation – ISO
▪ Get Ubuntu 16.04 AMD 64 Server Image
https://releases.ubuntu.com/16.04/
▪ Upload image to your VMware ISO data
store
▪ Create a virtual machine with a DVD drive
that boots from this ISO
▪ Next up creating an Ubuntu guest
12. sudo apt-get install -–install-recommends linux-generic-hwe-16.04 –y
DO THIS BEFORE YOU CREATE YOUR
KUBERNETES CLUSTER ON EACH NODE HOST,
OTHERWISE YOU WILL BREAK YOUR CLUSTER
Kernel Update Gotcha
13. Post Seed VM Creation Steps
▪ sudo apt-get update
▪ sudo apt-get install yamllint
▪ sudo reboot
▪ VMware vcenter -> virtual machine -> Template -> Convert to Template
14. Ubuntu VM
Template
Cluster Build Host
K8s Master 1
K8s Master 2
K8s Worker 1
K8s Worker 2
K8s Worker 3
Infrastructure Build Out From The Template
As we create each host, we need to do two things:
▪ Give each host a unique name
▪ Give each host a unique ip address
Tip: We could do this with Terraform and the VMware provider (very popular)
16. iSCSI Gotcha
▪ If you are using an iSCSI based storage solution and cloned virtual machines . . .
▪ InitiatorName value in /etc/iscsi/initiatorname.iscsi needs to be unique for each node host
17. IP Address Configuration
1. Get name of your network adapter, it should be prefixed by ens
For iSCSI storage, you will need two adapters – here we just have the one
18. IP Address Configuration
2. Edit the netplan configuration file /etc/network/interfaces
auto <primary network interface>
iface <primary network interface> inet static
address <ip address>
netmask <netmask>
gateway <gateway ip address>
iface <secondary network interface> inet static
address <ip address>
netmask <netmask>
dns-nameservers <ip address>
Secondary NIC required,
if iSCSI storage is used
20. ▪ A tool based on Ansible playbooks and kubeadm
for managing a Kubernetes cluster’s life cycle:
▪ Cluster creation
▪ Cluster removal
▪ Upgrading a cluster
▪ Adding nodes
▪ Removing node
▪ Rebuilding master nodes
▪ Etc . . .
Kubespray – What Is It ?
24. ▪ cp –r kubespray/inventory/sample
kubespray/inventory/<cluster name>
▪ Edit inventory.ini file,
example on the right
▪ Inventory file path:
kubespray/inventory/<cluster name>/inventory.ini
Kubespray – Create An Ansible Inventory
25. Kubespray – Configure ssh Connectivity
The following commands are all to be run on the server hosting ansible
▪ ssh-keygen
▪ Carry the following out for each node host:
ssh-copy-id <username>@<hostname>
▪ ssh-agent /bin/bash
▪ ssh-add ~/.ssh/id_rsa
▪ Test ssh connectivity from the ansible server:
ansible -i inventory/<cluster name>/inventory.ini all -m ping
26. Storing ssh Passphrases With keychain
On the server you intend to run Kubespray from:
▪ sudo apt install keychain
▪ Add the following two lines to your .bashrc file, ~cadkin/.bashrc in my case:
/usr/bin/keychain $HOME/.ssh/id_rsa
source $HOME/.keychain/$HOSTNAME-sh
30. ▪ Install kubectl on Kubespray server:
snap install kubectl --classic
▪ Create directory on Kubespray server to hold context:
cd ~
mkdir .kube
▪ ssh onto any node in the cluster and then run:
sudo chmod 755 /etc/kubernetes/admin.conf
▪ On the Kubespray server - admin.conf only resides on master node hosts
sudo scp <username>@<hostname>:/etc/kubernetes/admin.conf ~/.kube/config
▪ ssh back onto the master node you got copied the admin.conf file from and issue:
sudo chmod 620 /etc/kubernetes/admin.conf
Post Deployment Steps
31. ▪Check the health of the cluster nodes
kubectl get nodes –o wide
▪Create the health of the system pods
kubectl get po –n kube-system
Some Quick Post Cluster Creation Sanity Checks
32. ▪ We need a storage plugin that
supports persistent volumes
▪ Never ever use ephemeral storage in
production
▪ Free options:
Portworx essentials
VMware Cloud Native Storage
A Word On Storage
33. Check That You Have A Storage Plugin Installed
kubectl get sc
34. Perform A Simple Test
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: test-pvc
spec:
storageClassName: <storage class>
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
And then . . .
kubectl apply –f test-pvc.yaml
kubectl get pvc
test-pvc.yml file contents:
36. Sizing Your Cluster
Can you give me a reference
architecture for the infrastructure I
need for a Big Data Cluster ?
What you need really depends on
your workload, but . . .
37. Storage Gotchas
Persistent volume extension
As of CU6 persistent volumes (PVs) cannot be resized
through either azdata or Azure Data Studio
Pro tip: size PVs upfront to allow for data growth
39. Working With Configuration Profiles
▪ Create a profile
azdata bdc config init --path ca-bdc-kubeadm-dev-test --source kubeadm-dev-test
▪ Specify the storage class for data
azdata bdc config replace --path ca-bdc-kubeadm-dev-test/control.json
--json-values "$.spec.storage.data.className=pure-block"
▪ Specify the size for data persistent volumes
azdata bdc config replace --path ca-bdc-kubeadm-dev-test/control.json
--json-values "$.spec.storage.data.size=10Gi"
▪ Specify the storage class for logs
azdata bdc config replace --path ca-bdc-kubeadm-dev-test/control.json
--json-values "$.spec.storage.logs.className=pure-block"
▪ Specify the size for log persistent volumes
azdata bdc config replace --path ca-bdc-kubeadm-dev-test/control.json
--json-values "$.spec.storage.logs.size=5Gi"
40. Configuring The HDFS Replication Factor
azdata bdc config replace --path ca-bdc-kubeadm-dev-test/bdc.json
--json-values "$.spec.services.hdfs.settings={"hdfs-site.dfs.replication":"1"}"
▪ By default data is replicated three times
▪ If the storage platform has built-in
resilience, e.g. erasure coding we can . . .
44. We’ve Covered The Basics - Where To Next ?
▪ Load balancer installation and configuration - metallb is the easiest option
▪ Deploying the Kubernetes dashboard in a secure manner
▪ Backup and recovery
▪ Using production profiles which include HA and active directory integration
▪ Kubernetes cluster upgrades
▪ Monitoring a Kubernetes cluster via its built-in Prometheus exporter
45. Bill Of Materials
Component Version
VMware vSphere 6.7
Linux distribution Ubuntu server edition 16.04.7 LTS
Linux kernel 4.15.0-118-generic
Kubernetes 1.19.1
SQL Server 2019 Big Data Cluster CU6
Kubernetes storage plugin Pure Service Orchestrator 6.0.2