Paul Dix, CTO and co-founder of InfluxData, discussed the future of InfluxDB and the release of InfluxDB 2.0 Open Source. He explained that InfluxDB 2.0 has been rebuilt from the ground up to address limitations of the original InfluxDB like lack of distributed features and poor performance for high cardinality analytics data. The new database, called InfluxDB IOx, uses a columnar data store with parquet files and is designed to be distributed, federated, and able to run analytics at scale on high cardinality data.
In Pravega's first community meeting as a CNCF project, we overviewed experimental features of Pravega:
* Schema Registry - preserving the structure of data in an unstructured storage system and controlling for safe schema evolution
* Consumption-Based Retention - stream truncation based on subscriber positions
* Simplified Long-Term Storage (SLTS) - abstracting the distributed management of segments while removing complicated problems such as fencing
* SLTS Plugin for BookKeeper - an implementation of the SLTS interfaces for BlobIt! object stores on BookKeeper: https://github.com/diegosalvi/pravega-blobit-chunkmanager
Cloudian HyperStore offer 100% S3 compatibility for low-cost, scalable smart object storage.
With HyperStore 6.0, we are focused on bringing down operational costs so that you can more effectively track, manage, and optimize your data storage as you scale.
In object storage systems, objects are tagged with unique identifiers. Any modification leads to a new object, but with the common method of consistency hashing, some disks may end up getting used more than others.
Cloudian HyperStore utilizes dynamic object routing to even out average disk utilization and prevent a select number of disks from being overworked.
Introduction to Container Storage Interface (CSI)Idan Atias
Among the cool stuff we do at Silk, my colleagues and I develop the Silk CSI Plugin for customers who use our system as the storage layer for their Kubernetes workloads.
Before deep diving into the code and as part of my ramp-up on this subject I prepared some slides that cover some basic and important information on this topic.
These slides start by recapping some basic storage principals in containers and Kubernetes, continues with some more advanced use cases (including an "offline demo" of persisting Redis data on EBS volumes), and ends with a detailed information on the CSI solution itself.
IMHO, reviewing these slides can improve your understanding on this matter and can get you started implementing your own CSI plugin.
The main sources of information I used for preparing these slides are:
* Official CSI docs
* Kubernetes Storage Lingo 101 - Saad Ali, Google
* Container Storage Interface: Present and Future - Jie Yu, Mesosphere, Inc.
Learn more about Cloudian HyperStore's various features and benefits, including 100% S3 compatibility, multi-tenancy, data protection, and proactive data rebuilding.
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev
RSC's BasIS storage orchestration platform addresses complications with deploying DAOS storage. It simplifies DAOS deployment by dynamically composing DAOS clusters from servers' NVMe and PMEM resources over a fabric. This composable disaggregated approach provides flexibility to use PMEM nodes for different roles like DAOS or databases. The orchestration significantly improves on DAOS by making it deployable on existing heterogeneous servers and suitable for cloud environments. Performance tests show NVMe-over-Fabric with the orchestrator achieves similar throughput to local NVMe drives.
Cronicle is a multi-server task scheduler that can run jobs on multiple servers. Storreduce is a cloud storage deduplication solution that can reduce storage usage by up to 99% when backing up data to cloud object storage like S3. The proposed backup solution uses Cronicle to schedule backups, Storreduce for data deduplication, and named pipes for high-speed data transfer between servers and to S3. Differential backups are performed to reduce backup sizes and bandwidth usage.
Keeping your application’s latency SLAs no matter whatScyllaDB
Businesses that once measured performance in seconds now measure it down to the millisecond and even the microsecond in order to provide optimal user experience.
For a NoSQL database few things are more important than keeping latencies low and bounded. Yet some databases suffer latency spikes from such regular occurrences as Java Virtual Machine (JVM) “garbage collection,” context switches, database repair, cache flushes and so on. This makes long-tail latency very tricky to diagnose and fix, as it’s often a “whack-a-mole” exercise.
In this session, we will cover:
The systemic causes of latency spikes
How to keep latencies bounded and predictable
How to manage latency-inducing events
How Scylla helps optimize for 99% latency of <1msec
In Pravega's first community meeting as a CNCF project, we overviewed experimental features of Pravega:
* Schema Registry - preserving the structure of data in an unstructured storage system and controlling for safe schema evolution
* Consumption-Based Retention - stream truncation based on subscriber positions
* Simplified Long-Term Storage (SLTS) - abstracting the distributed management of segments while removing complicated problems such as fencing
* SLTS Plugin for BookKeeper - an implementation of the SLTS interfaces for BlobIt! object stores on BookKeeper: https://github.com/diegosalvi/pravega-blobit-chunkmanager
Cloudian HyperStore offer 100% S3 compatibility for low-cost, scalable smart object storage.
With HyperStore 6.0, we are focused on bringing down operational costs so that you can more effectively track, manage, and optimize your data storage as you scale.
In object storage systems, objects are tagged with unique identifiers. Any modification leads to a new object, but with the common method of consistency hashing, some disks may end up getting used more than others.
Cloudian HyperStore utilizes dynamic object routing to even out average disk utilization and prevent a select number of disks from being overworked.
Introduction to Container Storage Interface (CSI)Idan Atias
Among the cool stuff we do at Silk, my colleagues and I develop the Silk CSI Plugin for customers who use our system as the storage layer for their Kubernetes workloads.
Before deep diving into the code and as part of my ramp-up on this subject I prepared some slides that cover some basic and important information on this topic.
These slides start by recapping some basic storage principals in containers and Kubernetes, continues with some more advanced use cases (including an "offline demo" of persisting Redis data on EBS volumes), and ends with a detailed information on the CSI solution itself.
IMHO, reviewing these slides can improve your understanding on this matter and can get you started implementing your own CSI plugin.
The main sources of information I used for preparing these slides are:
* Official CSI docs
* Kubernetes Storage Lingo 101 - Saad Ali, Google
* Container Storage Interface: Present and Future - Jie Yu, Mesosphere, Inc.
Learn more about Cloudian HyperStore's various features and benefits, including 100% S3 compatibility, multi-tenancy, data protection, and proactive data rebuilding.
DUG'20: 10 - Storage Orchestration for Composable Storage ArchitecturesAndrey Kudryavtsev
RSC's BasIS storage orchestration platform addresses complications with deploying DAOS storage. It simplifies DAOS deployment by dynamically composing DAOS clusters from servers' NVMe and PMEM resources over a fabric. This composable disaggregated approach provides flexibility to use PMEM nodes for different roles like DAOS or databases. The orchestration significantly improves on DAOS by making it deployable on existing heterogeneous servers and suitable for cloud environments. Performance tests show NVMe-over-Fabric with the orchestrator achieves similar throughput to local NVMe drives.
Cronicle is a multi-server task scheduler that can run jobs on multiple servers. Storreduce is a cloud storage deduplication solution that can reduce storage usage by up to 99% when backing up data to cloud object storage like S3. The proposed backup solution uses Cronicle to schedule backups, Storreduce for data deduplication, and named pipes for high-speed data transfer between servers and to S3. Differential backups are performed to reduce backup sizes and bandwidth usage.
Keeping your application’s latency SLAs no matter whatScyllaDB
Businesses that once measured performance in seconds now measure it down to the millisecond and even the microsecond in order to provide optimal user experience.
For a NoSQL database few things are more important than keeping latencies low and bounded. Yet some databases suffer latency spikes from such regular occurrences as Java Virtual Machine (JVM) “garbage collection,” context switches, database repair, cache flushes and so on. This makes long-tail latency very tricky to diagnose and fix, as it’s often a “whack-a-mole” exercise.
In this session, we will cover:
The systemic causes of latency spikes
How to keep latencies bounded and predictable
How to manage latency-inducing events
How Scylla helps optimize for 99% latency of <1msec
DataStax recently announced the general availability of DataStax Enterprise 4.7 (DSE 4.7), the leading database platform purpose-built for the performance and availability demands of web, mobile, and IOT applications. In this product launch webinar, Robin Schumacher, VP of Products, explores the wide range of enhancements in DSE 4.7 including enterprise class search, analytics, and in-memory.
Zabbix was experiencing performance issues due to large history tables in the database. To address this, the architecture was changed to store history data in Elasticsearch instead of database tables. This improved scalability and performance. The basic item and event data remained in the MariaDB database cluster. Zabbix proxies were also used to distribute load across multiple network segments. With this new architecture, history data is indexed in Elasticsearch without database tables, improving query speed and reducing database size.
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
n this talk we will address how we developed our Cassandra environments utilizing Cisco UCS Open Stack Platform with the DataStax Enterprise Edition software. In addition we are utilizing OpenSource CEPH storage in our Infrastructure to optimize the Performance and reduce the costs.
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
Every enterprise spends significant resources to protect its data. This is especially true in the case of big data, since some of this data may include sensitive or confidential customer and financial information. Common methods for protecting data include permissions and access controls as well as the encryption of data at rest and in flight.
The Hadoop community has recently rolled out Transparent Data Encryption (TDE) support in HDFS. Transparent Data Encryption refers to the process whereby data is transparently encrypted by the big data application writing the data; it is not decrypted again until it is accessed by another application. The data is encrypted during its entire lifespan—in transit and at rest—except when it is being specifically accessed by a processing application.
TDE is an excellent approach for protecting data stored in data lakes built on the latest versions of HDFS. However, it does have its challenges and limitations. Systems that want to use TDE require tight integration with enterprise-wide Kerberos Key Distribution Center (KDC) services and Key Management Systems (KMS). This integration isn’t easy to set up or maintain. These issues can be even more challenging in a virtualized or containerized environment where one Kerberos realm may be used to secure the big data compute cluster and a different Kerberos realm may be used to secure the HDFS filesystem accessed by this cluster.
BlueData has developed significant expertise in configuring, managing, and optimizing access to TDE-protected HDFS. This session at the Strata Data Conference in March 2018 (by Thomas Phelan, co-founder and chief architect at BlueData) offers a detailed overview of how transparent data encryption works with HDFS, with a particular focus on containerized environments.
You’ll learn how HDFS TDE is configured and maintained in an environment where many big data frameworks run simultaneously (e.g., in a hybrid cloud architecture using Docker containers). Moreover, you’ll learn how KDC credentials can be managed in a Kerberos cross-realm environment to provide data scientists and analysts with the greatest flexibility in accessing data while maintaining complete enterprise-grade data security.
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63763
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
During this session Ben Lackey (DataStax) and Ravi Madasu (Google) will cover best practices for quickly setting up a cluster on Google Cloud Platform (GCP) using both Google Compute Engine (GCE) and Google Container Engine (GKE) which is based on Kubernetes and Docker.
About the Speakers
Ben Lackey Partner Architect, DataStax
I work in the Cloud Strategy group at DataStax where I concentrate on improving the integration between DataStax Enterprise and cloud platforms including Azure, GCP and Pivotal.
Ravi Madasu
Ravi Madasu is a program manager at Google, primarily focused on Google Cloud Launcher. He works closely with ISV partners to make their products and services available on the Google Cloud Platform providing a developer friendly deployment experience. He has 15+ years of experience, working in variety of roles such as software engineer, project manager and product manager. Ravi received a Masters degree in Information Systems from Northeastern University and an MBA from Carnegie Mellon University.
Reporting from the Trenches: Intuit & CassandraDataStax
Rekha Joshi presents on how Intuit uses the Cassandra database to enable personalized A/B testing and improve customer experiences. Intuit handles large volumes of customer data and required a database with high security, scalability, availability and tunable performance. Cassandra met these requirements and became Intuit's standard NoSQL database. Rekha discusses how Intuit leverages Cassandra's capabilities and provides best practices for effective Cassandra usage, configuration, and performance tuning.
Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell
A highly available time-series solution requires an efficient tailored front-end framework and a backend database with a fast ingestion rate. In this webinar, you'll learn the steps for building an efficient TSDB solution with Scylla and KairosDB, get real-world use cases and metrics, plus considerations when choosing time series solutions.
Why you need benchmarks
Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience.
You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution.
Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice.
In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment.
We will cover:
Data model impact on performance and latency
Client behavior related to database capabilities
Failover and high availability testing
Hardware selection and cluster configuration impact
We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case.
Attend this virtual workshop if you are:
Looking to minimize the cost of your database deployment
Making a database decision based on performance and scale data
Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.
Data Pipelines with Spark & DataStax EnterpriseDataStax
This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.
This meeting we'll host a discussion on Google Cloud Platform and Amazon Web Services to bring light to similarities and differences between platforms. If you have questions about how our platforms compare this is the meeting to attend!
Overcoming Barriers of Scaling Your DatabaseScyllaDB
Scaling distributed databases successfully requires meeting myriad challenges from physical distribution of your data across on-premises locations, public cloud vendors, geographies and political entities to adopting technologies to overcome fundamental operational bottlenecks. Join ScyllaDB's Peter Corless, director of technical advocacy, as he interviews Moreno Garcia y Silva, head of solution architecture, about how to navigate both technical ecosystem and database architectural challenges for this next tech cycle.
Takeaways:
- Recognizing and classifying barriers to scaling
- Solutions to overcome scaling challenges
- Upfront planning and real-time response
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...OpenStack
Audience Level
Intermediate
Synopsis
M3 is the latest generation system of the MASSIVE project, an HPC facility specializing in characterization science (imaging and visualization). Using OpenStack as the compute provisioning layer, M3 is a hybrid HPC/cloud system, custom-integrated by Monash’s R@CMon Research Cloud team. Built to support Monash University’s next-gen high-throughput instrument processing requirements, M3 is half-half GPU-accelerated and CPU-only.
We’ll discuss the design and tech used to build this innovative platform as well as detailing approaches and challenges to building GPU-enabled and HPC clouds. We’ll also discuss some of the software and processing pipelines that this system supports and highlight the importance of tuning for these workloads.
Speaker Bio
Blair Bethwaite: Blair has worked in distributed computing at Monash University for 10 years, with OpenStack for half of that. Having served as team lead, architect, administrator, user, researcher, and occasional hacker, Blair’s unique perspective as a science power-user, developer, and system architect has helped guide the evolution of the research computing engine central to Monash’s 21st Century Microscope.
Lance Wilson: Lance is a mechanical engineer, who has been making tools to break things for the last 20 years. His career has moved through a number of engineering subdisciplines from manufacturing to bioengineering. Now he supports the national characterisation research community in Melbourne, Australia using OpenStack to create HPC systems solving problems too large for your laptop.
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
IoT Architectural Overview - 3 use case studies from InfluxData InfluxData
This SlideShare reviews how an IoT Data platform fits in with any IoT Architecture to manage the data requirements of every IoT implementation. It is based on the learnings from existing IoT practitioners that have adopted an IoT Data platform using InfluxData. These clients have a range of solutions–from home automation (thermostat monitoring & management), to infrastructure management (solar panel monitoring and control) to manufacturing (equipment monitoring & control) as well as environmental management (green wall monitoring & control).
These learnings will help IoT adopters avoid the common pitfalls current clients faced on their journey to developing their IoT solution.
This document introduces HyperStore's "forever live" storage solution. It aims to provide a storage platform that allows for faster innovation and upgrade cycles without downtime or data migration. The FL3000 platform is a modular, high-density storage system that separates compute and storage into interchangeable modules. This extreme modularity allows components to be replaced or upgraded with zero downtime. The design goals focus on high density to reduce data center footprint, hot-swappable components for efficiency, minimizing failure domains, and reducing power and cooling costs. The "forever live" experience provides annual new features and qualified hardware upgrades without requiring downtime or data migration by deploying once and swapping modules as needed on the modular, software-defined platform.
Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB
Kubernetes is a declarative system for automatically deploying, managing, and scaling server-side applications and their dependencies. In this webinar, we will introduce Kubernetes at a high level and demonstrate how to get started using Scylla with Kubernetes and Google Compute Engine.
Join us to:
Understand the principles of Kubernetes and how it solves common problems of deploying distributed applications
Explore an example configuration of Scylla with Kubernetes that can serve as a starting point for your own system.
Get insight into the performance characteristics of Scylla when it it is run in a container (e.g. Docker) and deployed via Kubernetes.
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red HatOpenStack
Audience: Intermediate
About: Learn how cloud storage differs to traditional storage systems and how that delivers revolutionary benefits.
Starting with an overview of how Ceph integrates tightly into OpenStack, you’ll see why 62% of OpenStack users choose Ceph, we’ll then take a peek into the very near future to see how rapidly Ceph is advancing and how you’ll be able to achieve all your childhood hopes and dreams in ways you never thought possible.
Speaker Bio: Andrew Hatfield – Practice Lead–Cloud Storage and Big Data, Red Hat
Andrew has over 20 years experience in the IT industry across APAC, specialising in Databases, Directory Systems, Groupware, Virtualisation and Storage for Enterprise and Government organisations. When not helping customers slash costs and increase agility by moving to the software-defined storage future, he’s enjoying the subtle tones of Islay Whisky and shredding pow pow on the world’s best snowboard resorts.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack
Audience Level
All levels
Synopsis
Peter has been involved in OpenStack community since its B-release, and he has been enabling and helping customers across various industries adopt OpenStack in strategic ways. In this session, you will learn from his experience what Red Hat’s perspective is on the current state of affairs in the OpenStack community and the path we see ahead that Red Hat is putting its efforts in. OpenStack is not a product that tries to solve any one business problem in particular, but a technology that aims to be usable for many – what are the required steps to make sure that your organisation is ready for the OpenStack-based cloudification and transformation.
Speaker Bio:
Peter Jung is a Senior Business Development Manager at Red Hat where he leads the practice in the areas of Cloud, SDN/NFV and IoT across Australia and New Zealand. He is passionate about open innovation and open source software development model as the foundation for next generation society and ICT systems. Prior to Red Hat, he had various roles at Cisco and Dell for 15 years. He holds a BSEE and an MBA.
OpenStack Australia Day Melbourne 2017
https://events.aptira.com/openstack-australia-day-melbourne-2017/
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
1. The document provides an overview of InfluxEnterprise, including its core open source functionality, high availability features, scalability, fine-grained authorization, support options, and on-premise or cloud deployment options.
2. It discusses signs that an organization may be ready for InfluxEnterprise, such as high CPU usage, issues with single node deployments, and needing improved data durability or throughput.
3. The document covers InfluxEnterprise cluster architecture including meta nodes, data nodes, replication patterns, ingestion and query rates for different replication configurations, and examples for mothership, durable data ingest, and integrating with ElasticSearch deployments.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
DataStax recently announced the general availability of DataStax Enterprise 4.7 (DSE 4.7), the leading database platform purpose-built for the performance and availability demands of web, mobile, and IOT applications. In this product launch webinar, Robin Schumacher, VP of Products, explores the wide range of enhancements in DSE 4.7 including enterprise class search, analytics, and in-memory.
Zabbix was experiencing performance issues due to large history tables in the database. To address this, the architecture was changed to store history data in Elasticsearch instead of database tables. This improved scalability and performance. The basic item and event data remained in the MariaDB database cluster. Zabbix proxies were also used to distribute load across multiple network segments. With this new architecture, history data is indexed in Elasticsearch without database tables, improving query speed and reducing database size.
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
n this talk we will address how we developed our Cassandra environments utilizing Cisco UCS Open Stack Platform with the DataStax Enterprise Edition software. In addition we are utilizing OpenSource CEPH storage in our Infrastructure to optimize the Performance and reduce the costs.
How to Protect Big Data in a Containerized EnvironmentBlueData, Inc.
Every enterprise spends significant resources to protect its data. This is especially true in the case of big data, since some of this data may include sensitive or confidential customer and financial information. Common methods for protecting data include permissions and access controls as well as the encryption of data at rest and in flight.
The Hadoop community has recently rolled out Transparent Data Encryption (TDE) support in HDFS. Transparent Data Encryption refers to the process whereby data is transparently encrypted by the big data application writing the data; it is not decrypted again until it is accessed by another application. The data is encrypted during its entire lifespan—in transit and at rest—except when it is being specifically accessed by a processing application.
TDE is an excellent approach for protecting data stored in data lakes built on the latest versions of HDFS. However, it does have its challenges and limitations. Systems that want to use TDE require tight integration with enterprise-wide Kerberos Key Distribution Center (KDC) services and Key Management Systems (KMS). This integration isn’t easy to set up or maintain. These issues can be even more challenging in a virtualized or containerized environment where one Kerberos realm may be used to secure the big data compute cluster and a different Kerberos realm may be used to secure the HDFS filesystem accessed by this cluster.
BlueData has developed significant expertise in configuring, managing, and optimizing access to TDE-protected HDFS. This session at the Strata Data Conference in March 2018 (by Thomas Phelan, co-founder and chief architect at BlueData) offers a detailed overview of how transparent data encryption works with HDFS, with a particular focus on containerized environments.
You’ll learn how HDFS TDE is configured and maintained in an environment where many big data frameworks run simultaneously (e.g., in a hybrid cloud architecture using Docker containers). Moreover, you’ll learn how KDC credentials can be managed in a Kerberos cross-realm environment to provide data scientists and analysts with the greatest flexibility in accessing data while maintaining complete enterprise-grade data security.
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63763
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
During this session Ben Lackey (DataStax) and Ravi Madasu (Google) will cover best practices for quickly setting up a cluster on Google Cloud Platform (GCP) using both Google Compute Engine (GCE) and Google Container Engine (GKE) which is based on Kubernetes and Docker.
About the Speakers
Ben Lackey Partner Architect, DataStax
I work in the Cloud Strategy group at DataStax where I concentrate on improving the integration between DataStax Enterprise and cloud platforms including Azure, GCP and Pivotal.
Ravi Madasu
Ravi Madasu is a program manager at Google, primarily focused on Google Cloud Launcher. He works closely with ISV partners to make their products and services available on the Google Cloud Platform providing a developer friendly deployment experience. He has 15+ years of experience, working in variety of roles such as software engineer, project manager and product manager. Ravi received a Masters degree in Information Systems from Northeastern University and an MBA from Carnegie Mellon University.
Reporting from the Trenches: Intuit & CassandraDataStax
Rekha Joshi presents on how Intuit uses the Cassandra database to enable personalized A/B testing and improve customer experiences. Intuit handles large volumes of customer data and required a database with high security, scalability, availability and tunable performance. Cassandra met these requirements and became Intuit's standard NoSQL database. Rekha discusses how Intuit leverages Cassandra's capabilities and provides best practices for effective Cassandra usage, configuration, and performance tuning.
Webinar how to build a highly available time series solution with kairos-db (1)Julia Angell
A highly available time-series solution requires an efficient tailored front-end framework and a backend database with a fast ingestion rate. In this webinar, you'll learn the steps for building an efficient TSDB solution with Scylla and KairosDB, get real-world use cases and metrics, plus considerations when choosing time series solutions.
Why you need benchmarks
Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience.
You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution.
Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice.
In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment.
We will cover:
Data model impact on performance and latency
Client behavior related to database capabilities
Failover and high availability testing
Hardware selection and cluster configuration impact
We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case.
Attend this virtual workshop if you are:
Looking to minimize the cost of your database deployment
Making a database decision based on performance and scale data
Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.
Data Pipelines with Spark & DataStax EnterpriseDataStax
This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.
This meeting we'll host a discussion on Google Cloud Platform and Amazon Web Services to bring light to similarities and differences between platforms. If you have questions about how our platforms compare this is the meeting to attend!
Overcoming Barriers of Scaling Your DatabaseScyllaDB
Scaling distributed databases successfully requires meeting myriad challenges from physical distribution of your data across on-premises locations, public cloud vendors, geographies and political entities to adopting technologies to overcome fundamental operational bottlenecks. Join ScyllaDB's Peter Corless, director of technical advocacy, as he interviews Moreno Garcia y Silva, head of solution architecture, about how to navigate both technical ecosystem and database architectural challenges for this next tech cycle.
Takeaways:
- Recognizing and classifying barriers to scaling
- Solutions to overcome scaling challenges
- Upfront planning and real-time response
Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...OpenStack
Audience Level
Intermediate
Synopsis
M3 is the latest generation system of the MASSIVE project, an HPC facility specializing in characterization science (imaging and visualization). Using OpenStack as the compute provisioning layer, M3 is a hybrid HPC/cloud system, custom-integrated by Monash’s R@CMon Research Cloud team. Built to support Monash University’s next-gen high-throughput instrument processing requirements, M3 is half-half GPU-accelerated and CPU-only.
We’ll discuss the design and tech used to build this innovative platform as well as detailing approaches and challenges to building GPU-enabled and HPC clouds. We’ll also discuss some of the software and processing pipelines that this system supports and highlight the importance of tuning for these workloads.
Speaker Bio
Blair Bethwaite: Blair has worked in distributed computing at Monash University for 10 years, with OpenStack for half of that. Having served as team lead, architect, administrator, user, researcher, and occasional hacker, Blair’s unique perspective as a science power-user, developer, and system architect has helped guide the evolution of the research computing engine central to Monash’s 21st Century Microscope.
Lance Wilson: Lance is a mechanical engineer, who has been making tools to break things for the last 20 years. His career has moved through a number of engineering subdisciplines from manufacturing to bioengineering. Now he supports the national characterisation research community in Melbourne, Australia using OpenStack to create HPC systems solving problems too large for your laptop.
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
IoT Architectural Overview - 3 use case studies from InfluxData InfluxData
This SlideShare reviews how an IoT Data platform fits in with any IoT Architecture to manage the data requirements of every IoT implementation. It is based on the learnings from existing IoT practitioners that have adopted an IoT Data platform using InfluxData. These clients have a range of solutions–from home automation (thermostat monitoring & management), to infrastructure management (solar panel monitoring and control) to manufacturing (equipment monitoring & control) as well as environmental management (green wall monitoring & control).
These learnings will help IoT adopters avoid the common pitfalls current clients faced on their journey to developing their IoT solution.
This document introduces HyperStore's "forever live" storage solution. It aims to provide a storage platform that allows for faster innovation and upgrade cycles without downtime or data migration. The FL3000 platform is a modular, high-density storage system that separates compute and storage into interchangeable modules. This extreme modularity allows components to be replaced or upgraded with zero downtime. The design goals focus on high density to reduce data center footprint, hot-swappable components for efficiency, minimizing failure domains, and reducing power and cooling costs. The "forever live" experience provides annual new features and qualified hardware upgrades without requiring downtime or data migration by deploying once and swapping modules as needed on the modular, software-defined platform.
Steering the Sea Monster - Integrating Scylla with KubernetesScyllaDB
Kubernetes is a declarative system for automatically deploying, managing, and scaling server-side applications and their dependencies. In this webinar, we will introduce Kubernetes at a high level and demonstrate how to get started using Scylla with Kubernetes and Google Compute Engine.
Join us to:
Understand the principles of Kubernetes and how it solves common problems of deploying distributed applications
Explore an example configuration of Scylla with Kubernetes that can serve as a starting point for your own system.
Get insight into the performance characteristics of Scylla when it it is run in a container (e.g. Docker) and deployed via Kubernetes.
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red HatOpenStack
Audience: Intermediate
About: Learn how cloud storage differs to traditional storage systems and how that delivers revolutionary benefits.
Starting with an overview of how Ceph integrates tightly into OpenStack, you’ll see why 62% of OpenStack users choose Ceph, we’ll then take a peek into the very near future to see how rapidly Ceph is advancing and how you’ll be able to achieve all your childhood hopes and dreams in ways you never thought possible.
Speaker Bio: Andrew Hatfield – Practice Lead–Cloud Storage and Big Data, Red Hat
Andrew has over 20 years experience in the IT industry across APAC, specialising in Databases, Directory Systems, Groupware, Virtualisation and Storage for Enterprise and Government organisations. When not helping customers slash costs and increase agility by moving to the software-defined storage future, he’s enjoying the subtle tones of Islay Whisky and shredding pow pow on the world’s best snowboard resorts.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
OpenStack and Red Hat: How we learned to adapt with our customers in a maturi...OpenStack
Audience Level
All levels
Synopsis
Peter has been involved in OpenStack community since its B-release, and he has been enabling and helping customers across various industries adopt OpenStack in strategic ways. In this session, you will learn from his experience what Red Hat’s perspective is on the current state of affairs in the OpenStack community and the path we see ahead that Red Hat is putting its efforts in. OpenStack is not a product that tries to solve any one business problem in particular, but a technology that aims to be usable for many – what are the required steps to make sure that your organisation is ready for the OpenStack-based cloudification and transformation.
Speaker Bio:
Peter Jung is a Senior Business Development Manager at Red Hat where he leads the practice in the areas of Cloud, SDN/NFV and IoT across Australia and New Zealand. He is passionate about open innovation and open source software development model as the foundation for next generation society and ICT systems. Prior to Red Hat, he had various roles at Cisco and Dell for 15 years. He holds a BSEE and an MBA.
OpenStack Australia Day Melbourne 2017
https://events.aptira.com/openstack-australia-day-melbourne-2017/
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
1. The document provides an overview of InfluxEnterprise, including its core open source functionality, high availability features, scalability, fine-grained authorization, support options, and on-premise or cloud deployment options.
2. It discusses signs that an organization may be ready for InfluxEnterprise, such as high CPU usage, issues with single node deployments, and needing improved data durability or throughput.
3. The document covers InfluxEnterprise cluster architecture including meta nodes, data nodes, replication patterns, ingestion and query rates for different replication configurations, and examples for mothership, durable data ingest, and integrating with ElasticSearch deployments.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
In this training webinar, we will walk you through the basics of InfluxDB – the purpose-built time series database. InfluxDB has everything you need from a time series platform in a single binary – a multi-tenanted time series database, UI and dashboarding tools, background processing and monitoring agent. This one-hour session will include the training and time for live Q&A.
What you will learn
Core concepts of time series databases
An overview of the InfluxDB platform
How to ingesting and query data in InfluxDB
This document provides an overview of IBM's Internet of Things (IoT) architecture and capabilities. It discusses the key components of an IoT architecture including intelligent gateways, sensor analytics zones, and the deep analytics zone in the cloud. It describes how gateways can help IoT solutions by reducing cloud costs and latency through local analytics and filtering of sensor data. The document then outlines the requirements for databases in gateways, and explains how IBM's Informix database is well-suited to meet these requirements through its small footprint, low memory usage, support for time series and spatial data, and ability to ingest and analyze sensor data in real-time. Finally, it discusses how Informix can be used both in gateways and
Webinar: Faster Log Indexing with FusionLucidworks
The document discusses Lucidworks Fusion, a log analytics platform that combines Apache Solr, Logstash, and Kibana. It describes how Fusion uses a time-based partitioning scheme to index logs into daily collections with hourly shards for query performance. It also discusses using transient collections to handle high volume indexing into multiple shards to avoid bottlenecks. The document provides details on schema design considerations, moving old data to cheaper storage, and GC tuning for Solr deployments handling large-scale log analytics.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
IBM IoT Architecture and Capabilities at the Edge and Cloud Pradeep Natarajan
IBM Informix is presented as the ideal database solution for IoT architectures due to its small footprint, low memory requirements, support for time series and spatial data, and driverless operation requiring no administration. It can run on gateways to filter and analyze sensor data locally before transmitting to the cloud. In the cloud, Informix can ingest streaming data in real-time, perform operational analytics, and scale out across servers. Benchmarks show Informix outperforming SQLite for IoT workloads in areas like data loading speed, storage requirements, and analytic query speeds.
Apache Geode is an open source in-memory data grid that provides data distribution, replication and high availability. It can be used for caching, messaging and interactive queries. The presentation discusses Geode concepts like cache, region and member. It provides examples of how large companies use Geode for applications requiring real-time response, high concurrency and global data visibility. Geode's performance comes from minimizing data copying and contention through flexible consistency and partitioning. The project is now hosted by Apache and the community is encouraged to get involved through mailing lists, code contributions and example applications.
This document provides an introduction to time series data and InfluxDB. It defines time series data as measurements taken from the same source over time that can be plotted on a graph with one axis being time. Examples of time series data include weather, stock prices, and server metrics. Time series databases like InfluxDB are optimized for storing and processing huge volumes of time series data in a high performance manner. InfluxDB uses a simple data model where points consist of measurements, tags, fields, and timestamps.
This document discusses using Apache Geode and ActiveMQ Artemis to build a scalable IoT platform. It introduces IoT and the MQTT protocol. ActiveMQ Artemis is described as a high performance message broker that is embeddable and supports clustering. Geode is presented as a distributed in-memory data platform for building data-intensive applications that require high performance, scalability, and availability. Example users of Geode include large companies handling billions of records and thousands of transactions per second. Key capabilities of Geode like regions, functions, querying, and continuous queries are summarized.
Zookeeper is a distributed coordination service that provides naming, configuration, synchronization, and group services. It allows distributed processes to coordinate with each other through a shared hierarchical namespace of data registers called znodes. Zookeeper follows a leader-elected consensus protocol to guarantee atomic broadcast of state updates from the leader to followers. It uses a hierarchical namespace of znodes similar to a file system to store configuration data and other application-defined metadata. Zookeeper provides services like leader election, group membership, synchronization, and configuration management that are essential for distributed systems.
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
This document provides an agenda for a hands-on introduction and hackathon kickoff for Apache Geode. The agenda includes details about the hackathon, an introduction to Apache Geode including its history and key features, a hands-on lab to build, run, and use Geode, and a Q&A session. It also outlines how to contribute to the Geode project through code, documentation, issue tracking, and mailing lists.
This document provides an overview of IBM's Internet of Things architecture and capabilities. It discusses how IBM's Informix database can be used in intelligent gateways and the cloud for IoT solutions. Specifically, it outlines how Informix is well-suited for gateway and cloud environments due to its small footprint, support for time series and spatial data, and ability to handle both structured and unstructured data. The document also provides examples of how Informix can be used with Node-RED and Docker to develop IoT applications and deploy databases in the cloud.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview of scaling a Splunk deployment from an initial use case to a larger enterprise deployment. It discusses growing use cases and data volume over time. The agenda covers use case mapping, simple scaling approaches, indexer and search head clustering, distributed management, and hybrid cloud deployments. Best practices are outlined for sizing storage, tuning indexers, and designing high availability into the forwarding, indexing, and search tiers. Clustering impacts on storage sizing and additional hosts are also addressed.
This document outlines the agenda for a training on Oracle RDBMS 12c new features. The training will cover 6 chapters: introduction, multitenant architecture, upgrade features, Flex Cluster, Global Data Service, and an overview of RDBMS features. The agenda provides a high-level overview of topics to be discussed in each chapter, including multitenant architecture concepts, upgrade options and tools, Flex Cluster configurations, Global Data Service components, and new features such as temporary undo and multiple indexes on the same columns.
The document provides an introduction to the ELK stack for log analysis and visualization. It discusses why large data tools are needed for network traffic and log analysis. It then describes the components of the ELK stack - Elasticsearch for storage and search, Logstash for data collection and parsing, and Kibana for visualization. Several use cases are presented, including how Cisco and Yale use the ELK stack for security monitoring and analyzing biomedical research data.
Similar to Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020 (20)
InfluxData is excited to announce InfluxDB Clustered, the self-managed version of InfluxDB 3.0 with unparalleled flexibility, speed, performance, and scale. The evolution of InfluxDB Enterprise, InfluxDB Clustered is delivered as a collection of Kubernetes-based containers and services, which enables you to run and operate InfluxDB 3.0 where you need it, whether that's on-premises or in a private cloud environment. With this new enterprise offering, we’re excited to provide our customers with real-time queries, low-cost object storage, unlimited cardinality, and SQL language support – all with improved data access, support, and security! The newest version of InfluxDB was built on Apache Arrow, and through the open source ecosystem and integrations, extends the value of your time-stamped data.
Join this webinar to learn more about InfluxDB Clustered, and how to manage your large mission-critical workloads in the highly available database service offering!
In this webinar, Balaji Palani and Gunnar Aasen will dive into:
Key features of the new InfluxDB Clustered solution
Use cases for using the newest version of the purpose-built time series database
Live demo
During this 1-hour technical webinar, you’ll also get a chance to ask your questions live.
Best Practices for Leveraging the Apache Arrow EcosystemInfluxData
Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. It enables more efficient analytics workloads for modern CPU and GPU hardware, which makes working with large data sets easier and cheaper.
InfluxData and Dremio are both members of the Apache Software Foundation (ASF). Dremio is a data lakehouse management service known for its scalability and capacity for direct querying across diverse data sources. InfluxDB is the purpose-built time series database, and InfluxDB 3.0 has a new columnar storage engine and uses the Arrow format for representing data and moving data to and from Parquet. Discover how InfluxDB and Dremio have advanced their solutions by relying on the Apache Arrow framework.
Join this live panel as Alex Merced and Anais Dotis-Georgiou dive into:
Advantages to utilizing the Apache Arrow ecosystem
Tips and tricks for implementing the columnar data structure
How developers can best utilize the ASF to innovate and contribute to new industry standards
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...InfluxData
Bevi are the creators of smart water dispensers which empower people to choose their desired beverage — flat or sparkling, their desired flavor and temperature. Since 2014, Bevi users have saved more than 350 million bottles and cans. Their "smart" water coolers have prevented the extraction of 1.4 trillion oz of oil from Earth and have saved 21.7 billion grams of CO2 from the atmosphere.
Discover how Bevi uses a time series database to enable better predictive maintenance and alerting of their entire ecosystem — including the hardware and software. They are using InfluxDB to collect sensor data in real-time remotely from their internet-connected machines about their status and activity — i.e., flavor and CO2 levels, water temp, filter status, etc. They a7re using these metrics to improve their customer experience and continuously improve their sustainability practices. Gain tips and tricks on how to best utilize InfluxDB's schema-less design.
Join this webinar as Spencer Gagnon dives into:
Bevi's approach to reducing organizations' carbon footprint — they are saving 50K+ bottles and cans annually
Their entire system architecture — including InfluxDB Cloud, Grafana, Kafka, and DigitalOcean
The importance of using time-stamped data to extend the life of their machines
Power Your Predictive Analytics with InfluxDBInfluxData
If you're using InfluxDB to store and manage your time series data, you're already off to a great start. But why stop there? In our upcoming webinar, we'll show you how to take your data analysis to the next level by building predictive analytics using a variety of tools and techniques.
We will demonstrate how to use Quix to create custom dashboards and visualizations that allow you to monitor your data in real-time. We'll also introduce you to Hugging Face, a powerful tool for building models that can predict future trends and identify anomalies. With these tools at your disposal, you'll be able to extract valuable insights from your data and make more informed decisions about the future. Don't miss out on this opportunity to improve your data analysis skills and take your business to the next level!
What you will learn:
Use InfluxDB to store and manage time series data
Utilize Quix and Hugging Face to build models, visualize trends, and identify anomalies
Extract valuable insights from your data
Improve your data analysis skills to make informed decision
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base InfluxData
Are you considering replacing your legacy data historian and moving your OT data to the cloud? Join this technical webinar to learn how to adopt InfluxDB and IO Base - a digital platform used to improve operational efficiencies!
Teréga Solutions are the creators of digital solutions used to improve energy efficiencies and to address decarbonization challenges. Their network includes 5,000+ km of gas pipelines within France; they aim to help France attain carbon neutrality by 2050. With these impressive goals in mind, Teréga has created IO-Base — the digital platform to improve industrial performance, and increase profitability. Creating digital twins for their clients allows them to collect data from all production sites and view it in real time, from anywhere and at any time.
Discover how Teréga uses InfluxDB, Docker, and AWS to monitor its gas and hydrogen pipeline infrastructure. They chose to replace their legacy data historian with InfluxDB — the purpose built time series database. They are collecting more than 100K different metrics at various frequencies — some are collected every 5 seconds to only every 1-2 minutes. THey have reduced overall IT spend by 50% and collect 2x the amount of data at 20x frequency! By using various industrial protocols (Modbus, OPC-UA, etc.), Teréga improved output, reduced the TCO, and is now able to create added-value services: forecast, monitoring, predictive maintenance.
Join this webinar as Thomas Delquié dives into:
Teréga's approach to modernizing fossil fuel pipelines IT systems while improving yields and safety
Their centralized methodology to collecting sensor, hardware, and network metrics
The importance of time series data and why they chose InfluxDB
Build an Edge-to-Cloud Solution with the MING StackInfluxData
FlowForge enables organizations to reliably deliver Node-RED applications in a continuous, collaborative, and secure manner. Node-RED is the popular, low-code programming solution that makes it easy to connect different services using a visual programming environment. InfluxData is the creator of InfluxDB, the purpose-built time series database run by developers at scale and in any environment in the cloud, on-premises, or at the edge.
Jump-start monitoring your industrial IoT devices and discover how to build an edge-to-cloud solution with the MING stack. The MING stack includes Mosquitto/MQTT, InfluxDB, Node-RED, and Grafana. This solution can be used to improve fleet management, enable predictive maintenance of industrial machines and power generation equipment (i.e. turbines and generators) and increase safety practices (i.e. buildings, construction sites). Join this webinar to learn best practices from industrial IoT SME's.
In this webinar, Robert Marcer and Jay Clifford dive into:
Best practices for monitoring sensor data collected by everyone — from the edge to the factory
Tips and tricks for using Node-RED and InfluxDB together
Demo — see Node-RED and InfluxDB live
Meet the Founders: An Open Discussion About Rewriting Using RustInfluxData
The document is an agenda for a discussion between the CTO and founder of Ockam, Mrinal Wadhwa, and the CTO and founder of InfluxData, Paul Dix, about rewriting products using the Rust programming language. It includes an introduction of the founders, an overview of the discussion topics like why they decided to rewrite in Rust and the challenges they faced, how they got their engineers comfortable with Rust, tips they learned in the process, benefits gained from moving to Rust, and how their communities responded to the switch.
InfluxData is excited to announce the general availability of InfluxDB Cloud Dedicated! It is a fully managed time series database service running on cloud infrastructure resources that are dedicated to a single tenant. With this new offering, we’re excited to provide our customers with additional security options, and more custom configuration options to best suit customers’ workload requirements. Join this webinar to learn more about InfluxDB Cloud, and the new dedicated database service offering!
In this webinar, Balaji Palani and Gary Fowler will dive into:
Key features of the new InfluxDB Cloud Dedicated solution
Use cases for using the newest version of the purpose-built time series database
Live demo
During this 1-hour technical webinar, you’ll also get a chance to ask your questions live.
Gain Better Observability with OpenTelemetry and InfluxDB InfluxData
Many developers and DevOps engineers have become aware of using their observability data to gain greater insights into their infrastructure systems. InfluxDB is the purpose-built time series database used to collect metrics and gain observability into apps, servers, containers, and networks. Developers use InfluxDB to improve the quality and efficiency of their CI/CD pipelines. Start using InfluxDB to aggregate infrastructure and application performance monitoring metrics to enable better anomaly detection, root-cause analysis, and alerting.
This session will demonstrate how to record metrics, logs, and traces with one library — OpenTelemetry — and store them in one open source time series database — InfluxDB. Zoe will demonstrate how easy it is to set up the OpenTelemetry Operator for Kubernetes and to store and analyze your data in InfluxDB.
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...InfluxData
American Metal Processing Company ("AMP") is the US' largest commercial rotary heat treat facility with customers in the automotive, construction, military, and agriculture industries. They use their atmosphere-protected rotary retort furnaces to provide their clients with three primary hardening services: neutral hardening (quench and temper), carburizing, and carbonitriding.
This furnace style ensures consistent, uniform heat treatment process vs. traditional batch-or-belt-style furnaces; excels at processing high volumes of smaller parts with tight tolerances; and improves the strength and toughness of plain carbon steels. Discover why AMP’s use of Telegraf, InfluxDB, Node-RED, and Grafana allows them to gain 24/7 insights into their plant operations and metallurgical results. Learn how they use time-stamped data to gain accurate metrics about their consumables usage, furnace profiles, and machine status.
Join this webinar as Grant Pinkos dives into:
American Metal Processing's approach to heat treating in a digitized environment through connected systems
Their approach to collecting and measuring sensor data to enable predictive maintenance and improve product quality
Why they need a time series database for managing and analyzing vast amounts of time-stamped data
How Delft University's Engineering Students Make Their EV Formula-Style Race ...InfluxData
Delft University is the oldest and largest technical university in the Netherlands with 25,000+ students. Since 1999, they have had a team of students (undergraduate and graduate) designing, building, and racing cars, as part of the Formula Student worldwide competition. The competition has grown to include teams from 1K+ universities in 20+ countries. Students are responsible for all aspects of car manufacturing (research, construction, testing, developing, marketing, management, and fundraising). Delft University's team includes 90 students across disciplines.
Discover how Delft University's team uses Marple and InfluxDB to collect telemetry and sensor metrics while they develop, test, and race their electrics cars. They collect sensor data about their EV's control systems using a time series platform. During races, they are collecting IoT data about their batteries, accelerometer, gyroscope, tires, etc. The engineers are able to share important car stats during races which help the drivers tweak their driving decisions — all with the goal of winning. After races, the entire team are able to analyze data in Marple to understand what to do better next time. By using Marple + InfluxDB, their team are able to collect, share and analyze high frequency car data used to make their car faster at competitions.
Join this webinar as Robbin Baauw and Nero Vanbiervliet dive into:
Marple's approach to empowering engineers to organize, analyze, and visualize their data
Delft University's collaborative methodology to building and racing their Formula-style race car
How InfluxDB is crucial to their collaborative engineering and racing process
Introducing InfluxDB’s New Time Series Database Storage EngineInfluxData
InfluxData is excited to announce the general availability of InfluxDB Cloud's new storage engine! It is a cloud-native, real-time, columnar database optimized for time series data. InfluxDB's rebuilt core was coded in Rust and sits on top of Apache Arrow and DataFusion. InfluxData's team picked Apache Parquet as the persistent format. In this webinar, Paul Dix and Balaji Palani will demonstrate key product features including the removal of cardinality limits!
They will dive into:
The next phase of the InfluxDB platform
How using Apache Arrow's ecosystem has improved InfluxDB's performance and scalability
Key features of InfluxDB Cloud's new core — including SQL native support
Start Automating InfluxDB Deployments at the Edge with balena InfluxData
balena.io helps companies develop, deploy, update, and manage IoT devices. By using Linux containers and other cloud technologies, balena enables teams to quickly and easily build fleets of connected devices. Developers are able to use containers with the language of choice and pull IoT sensor data from 70+ different single board computers into balenaCloud. Discover how to use balena.io to automate your InfluxDB deployments at the edge!
During this one-hour session, experts from balena and InfluxData will demonstrate how to build and deploy your own air quality IoT solution. You will learn:
The fundamentals of IoT sensor deployment and management using balena.
How to use a time series platform to collect and visualize metrics from edge devices.
Tips and tricks to using balenaCloud to automate InfluxDB deployments and Telegraf configurations.
How to use InfluxDB's Edge Data Replication feature to collect sensor data and push it to InfluxDB Cloud for analysis.
No coding experience required, just a curiosity to start your own IoT adventure.
Understanding InfluxDB’s New Storage EngineInfluxData
Learn more about InfluxDB’s new storage engine! The team developed a cloud-native, real-time, columnar database optimized for time series data. We built it all in Rust and it sits on top of Apache Arrow and DataFusion. We chose Apache Parquet as the persistent format, which is an open source columnar data file format. This new storage engine provides InfluxDB Cloud users with new functionality, including the removal of cardinality limits, so developers can bring in massive amounts of time series data at scale.
In this webinar, Anais Dotis-Georgiou will dive into:
Requirements for rebuilding InfluxDB’s core
Key product features and timeline
How Apache Arrow’s ecosystem is used to meet those requirements
Stick around for a demo and live Q&A
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBInfluxData
RudderStack — the creators of the leading open source Customer Data Platform (CDP) — needed a scalable way to collect and store metrics related to customer events and processing times (down to the nanosecond). They provide their clients with data pipelines that simplify data collection from applications, websites, and SaaS platforms. RudderStack's solution enables clients to stream customer data in real time — they quickly deploy flexible data pipelines that send the data to the customer's entire stack without engineering headaches. Customers are able to stream data from any tool using their 16+ SDK's, and they are able to transform the data in-transit using JavaScript or Python. How does RudderStack use a time series platform to provide their customers with real-time analytics?
Join this webinar as Ryan McCrary dives into:
RudderStack's approach to streamlining data pipelines with their 180+ out-of-the-box integrations
Their data architecture including Kapacitor for alerting and Grafana for customized dashboards
Why using InfluxDB was crucial for them for fast data collection and providing single-sources of truths for their customers
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...InfluxData
Customers using ThingWorx and the Manufacturing Solutions often need to store property data longer than the Solutions default to. These customers are recommended to use InfluxDB, and this presentation will cover the key considerations for moving to InfluxDB vs the standard ThingWorx value streams. Join this session as Ward highlights ThingWorx’s solution and its easy implementation process.
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022InfluxData
Two new features are coming to Flux that add flexibility
and functionality to your data workflow—polymorphic
labels and dynamic types. This session walks through
these new features and shows how they work.
This document outlines the schedule for Day 2 of InfluxDays 2022, an event hosted by InfluxData. The schedule includes sessions on building developer experience, how developers like to work, an overview of the InfluxDB developer console and API, demos of client libraries and the InfluxDB v2 API, tips for getting involved in the InfluxDB community and university, use cases for networking monitoring, crypto/fintech, monitoring/observability, and IIoT, and closing thoughts. Recordings of all sessions will be made available to registered attendees by November 7th. Upcoming events include advanced Flux training in London and resources through the community forums, Slack channel, and online university.
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...InfluxData
This document contains the agenda for Day 2 of InfluxDays 2022, which includes:
- Welcome and introductory remarks from Zoe Steinkamp and Jay Clifford of InfluxData.
- Fireside chats and presentations on building great developer experiences, how developers like to work, and use cases for InfluxDB from companies like Tesla, InfluxData, and others.
- Sessions on the InfluxDB developer console, APIs, client libraries, getting involved in the community, accelerating time to awesome with InfluxDB University, and tips for analyzing IoT data with InfluxDB.
- Closing thoughts from Zoe Steinkamp and Jay Clifford, as well as
The document summarizes the agenda and sessions for Day 1 of InfluxDays 2022. It includes sessions on InfluxDB data collection, scripting languages like Flux, the InfluxDB time series engine, tasks, storage, and a closing discussion. The agenda involves talks from InfluxData employees on building applications with real-time data, navigating the developer experience, solving problems, the InfluxDB platform, community, education, use cases in crypto/fintech and IIoT, and tips/tricks for analysis.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
23. Requirements
• What cardinality?
• Analytics performance
• Separate compute from storage and tiered storage
• Operator defined Replication & Partitioning
• Able to run without locally attached storage
• Bulk data import and export
• Subscriptions
• Federated by design
• Embeddable scripting
• Greater compatibility
50. In-memory Perf Preview (tracing example)
• env - production or staging environment
• data_centre - the region within a cloud vendor
• cluster - a specific cluster, e.g., a k8s cluster
• user_id - an id associated with the user that issued a request that was traced
• request_id - an id associated with a single request that started a trace
• trace_id - a single id associated with all spans in the trace
• node_id - the id of compute node that the trace execution ran across
• pod_id - the id of containers that the trace execution ran across
• span_id - a random id for every sample generated in the trace
51. Test data cardinalities
104,998,932 rows
• env - 2
• data_centre - 20
• cluster - 200
• user_id - 200,000
• request_id - 2,000,000
• trace_id - 10,000,000
• node_id - 2,000
• pod_id - 20,000
• span_id - ∞ (a new one for each sample row)
53. Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;
54. Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;
Returned in: 84.666665ms ~ 1.1B rows/sec
56. Flexible Replication Rules
• Synchronous & Asynchronous
• Push & Pull
• Request by request, batch, or bulk
• Partition to servers, groups of servers
• Total operator control via RESTful API
76. Get Involved
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.
Editor's Notes
Today I want to talk to you about the future of InfluxDB. But before that, let’s talk about some the big news!
InfluxDB 2.0 open source is now released! This represents a multi-year effort. With our cloud offering, our goal was to switch to a continuous, services based, cloud first delivery model that could be billed by usage, not by servers. This means that we ship production code every business day and make continuous incremental improvement and our customers for Cloud 2 only pay for what they use.
For our open source, we wanted to ship an all-in-one database, monitoring system, visualization engine, and scripting scheduler. Flux, our new scripting and query language was the center of this effort. With it, users can now do more than ever before within the database. They can even call out to third-party APIs to bring in more data, send data out, trigger action, or send alerts. This can happen at query time in ad-hoc queries, or scheduled through the Task scheduling system.
Our goal was to ship the same API in our cloud offering and in open source. We think you’ll love the open source InfluxDB 2.0 for local development and deployment at the edge or on single servers within your cloud or data center environment. Ryan Betts, our VP of Engineering will be covering more of the details in the talk right after mine.
For my talk, I wanted to tell you about what we’re thinking for The Future. I realize it may be early to start thinking about this with 2.0 open source just being released today, but I’m very excited about the work some of us have been doing and I want to share it publicly.
But before we get into the future, I need to talk about the past. Specifically, November 12th, 2013, which is the day I gave the first talk about InfluxDB and introduced it to the world. While the next 60 seconds will likely be review for all of you, I hope you’ll bear with me as I set the stage.
The talk was titled: Introducing InfluxDB, an open source distributed time series database.
In that talk I sought to define what I meant by time series data. I pointed to some specific examples.
Metrics being the first and most obvious example of time series as that was what most people thought of when I talked about that
I went further to give more examples of events. All things I thought could be analyzed, inspected, visualized and summarized as a time series.
I would later add sensor data to this list of time series examples.
And I talked about two different kinds of time series
More broadly, I claimed that all data you perform analytics on is time series data. Meaning, anytime you’re doing data analysis, you’re doing it either over time or as a snapshot in time.
I saw time series as a useful abstraction for solving problems and building applications in a number of different use cases.
The vision I laid out then is still one I have today, which is that InfluxDB should be useful for all kinds of time series data. It should also be the building block upon which future monitoring, analytics, sensor data and time series applications can be built on.
So where are we today? Some of what I’ll say is generally about the platform and some of it will be specific to open source.
Easy to write data in with libraries in many languages. Easy to query using either InfluxQL or Flux.
With the addition of Flux, there are so many more things that InfluxDB can do outside of what a normal declarative query language can provide. It’s great for analytics. However, the caveat exists that this is only true for lower cardinality data. That is you don’t have too many unique time series and your tag values don’t have too many unique values.
InfluxDB lacking distributed features in open source means that it is frequently not chosen as a building block for time series applications. This limitation is an unfortunate, but at the time it was a necessary choice that enabled us to build a business to support our open source efforts. However, it definitely gets in the way of our broader platform vision. InfluxDB should be a platform that is adopted by a very wide audience, well outside the audience of our paying customer base.
We want to push what’s possible with InfluxDB forward. Ideally for both our open source users and our paying customers.
No limits on cardinality. Write any kind of event data and don’t worry about what a tag or field is.
Best-in-class performance on analytics queries in addition to our already well-served metrics queries.
Tiered data storage. The DB should use cheaper object storage as its long-term durable store.
Operator control over memory usage. The operator should be able to define how much memory is used for each of buffering, caching, and query processing.
Operator-controlled replication. The operator should be able to set fine-grained replication rules on each server.
Operator-controlled partitioning. The operator should be able to define how data is split up amongst many servers and on a per-server basis.
Operator control over topology including the ability to break up and decouple server tasks for write buffering and subscriptions, query processing, and sorting and indexing for long term storage.
Designed to run in an ephemeral containerized environment. That is, it should be able to run with no locally attached storage.
Bulk data export and import.
Fine-grained subscriptions for some or all of the data.
Broader ecosystem compatibility. Where possible, we should aim to use and embrace emerging standards in the data and analytics ecosystem.
Run at the edge and in the datacenter. Federated by design.
Embeddable scripting for in-process computation.
Not only does it expand the index, for cases like tracing where you have new values all the time, the index becomes larger than the time series data itself.
One way around this is to use fields rather than tags, but that is a limiting choice since you don’t have control over how data is organized in the DB, and thus how you might want to organize it outside of the tag system.
In order to support high cardinality use cases, we’d need to ditch the inverted index and also our indexing by individual time series. As our VP of Engineering, Ryan Betts, says: InfluxDB over indexes for these use cases.
InfluxDB uses memory mapped files for the inverted index and for the time series data storage. Many modern databases have been built using this because it gives you speed of development and offloads memory management to the OS.
The downside is that you loose fine grained control over how memory is used and allocated. Mmap has also proven tricky in containerized environments.
Finally, we want to be able to run with or without locally attached storage. The way that TSM and TSI organizes data doesn’t lend itself well to having some data in object storage, some in memory, and some cached on local SSD.
Once I realized that a gradual refactor wasn’t possible, I started thinking about what it would look like to start new in 2020 rather than 2013. What tools exist today that weren’t at my disposal seven years ago? What other open source could I bring to bear that would speed this effort up?
So we’re building a new core for InfluxDB. And here’s the first thing to know about it.
This project is written in Rust. I’ve written about my excitement for the language before. I think Rust is the future of systems software. It gives us the fine grained control over memory that we’re looking for, but with the safety of a higher level language.
Even better, its model for programming concurrent applications, which most server software, including this project are, eliminates data races. Within our Go codebase this has been a source of a number of very hard to track down bugs over the years. Its error handling also helps developers write correct software and reduces the number of runtime bugs you might otherwise create.
Also, it’s embeddable into other languages and systems. This means we can embed it into InfluxDB or other parts of our stack or other analytics systems. We could even compile it down to web assembly and run it in the browser.
There’s so much to love about Rust, but this talk isn’t about that. But ultimately, I want this project to form the basis of future analytics systems for the next few decades and beyond. I remember some blog post that Bryan Cantril wrote about Rust where he talked about software with longevity and he felt that Rust was a language that would ultimately help you build that kind of software. That’s the bet we’re making here.
The project is InfluxDB IOx, which is short for iron oxide so it’s pronounced InfluxDB eye-ox.
We’ll take a look at the high level architecture of it, but I just want to caveat this. This project is very early stage. We’ve largely been in research mode validating our assumptions on performance, compression and functionality. We’re not producing builds yet and we don’t have documentation up yet. But there’s a project README and you can build from source.
We wanted to open this up early so that our community of users could see what we’re doing.
The second thing to know is that this project is built around Apache Arrow. Arrow is an in-memory columnar data specification. But it’s also a persistence format via Apache Parquet, which is widely used both inside and outside the Arrow ecosystem. Most data warehouses and big data processing systems can read and write Parquet data.
Arrow is also Arrow Flight, an RPC specification and high performance client/server framework for transferring large datasets over the network.
Within the Rust part of Arrow is another project called DataFusion, which is a columnar SQL execution engine. We’re building on top of that and contributing to it.
We’re using all of these tools. That makes the big headline with Arrow the fact that we’re no longer creating this database by ourselves. With Arrow as the core, we’re working with contributors around the world that are using these libraries in their own data systems.
This is the big architectural change. InfluxDB IOx is an in-memory columnar database that uses object storage for persistence with data stored in Parquet files.
We looked at the existing open source columnar databases when we were starting out. We wondered if they could form the basis of a future InfluxDB backend. What we found was that they weren’t optimized for time series. Specifically, they have varying degrees of dictionary support, which is critical for our use case, little support for querying directly on compressed in-memory data with late materialization, and they weren’t optimized for windowed aggregates and computation on time. They seem to be built around a pure analytics use case that asks a question about aggregations to a single point in time.
Further, they weren’t built with our core need of being able to run with in an ephemeral environment with no locally attached storage using object store for all persistence. Our evaluation pointed to a missing solution in the open source market.
It’s not a storage engine. We’re not building our own storage engine short of buffering data in memory and writing it out to Parquet files. The persistence formats we’re using under the hood are Flatbuffers for the write ahead log and Parquet files for immutable blocks of data.
With Parquet and object storage for persistence, this opens up how you can interact with your data. Backup and restore is outside the concerns of InfluxDB IOx. You can create any kind of backup & restore system you’d like. An IOx server can read some or all of its data from object storage on startup.
Bulk data transfers become trivial. Clients can get Parquet files directly from object storage and they can send Parquet files to InfluxDB IOx to organize in object storage for later query workloads. Thanks to Apache Arrow, there are libraries in many languages to work with Parquet and the support is getting better month over month. Notably, Python, C++ and Java are first class citizens in the Arrow ecosystem. They represent the gold standard of functionality. We’ll help bring Rust up to the same level of compatibility.
Training a machine learning model? Ask IOx where the Parquet files are that have the data you’re looking for, get the directly from object storage and have it in your Python library of choice, all with a few lines of code.
I should mention that I’m referring to object store, but there are other abstractions
I want to talk quickly about how data is organized in InfluxDB IOx. I think this is important because it shows the flexibility you have as an operator and a user and it lets you optimize for having large blocks of immutable unchanging data, which is really what time series is all about. If you’re updating your data, that means you’re literally rewriting history. Sometimes you might do this, but that’s not what we’re optimizing for. We’re optimizing for history being a fixed thing that you can work with easily and modify on the fly at query time.
That means that you have blocks of data that you can move around to other servers, send out to clients, and represent compactly in object storage.
First you have the partition key, which is generated for each line that comes in. It can use any of the metadata or actual data to generate a string that represents the partition key. You could have the measurement name, tag key information or field information or time/date formatting.
Partitions are logical groupings of data based on the same partition key. When a partition is snapshotted, you create an immutable block of data. A partition can have multiple blocks, but ideally you’re buffering up everything to snapshot once into a single block. You can always compact blocks later, but this can be a separate process completely outside of the DB.
Blocks have tables of data where a table is once again a logical concept. At the physical level, you have individual Parquet files, which have one table in each and you have in-memory compressed segments that are optimized for query speed with some compression via encoding schemes.
One table per measurement. Tags and fields become columns. One table per Parquet file.
This means that tag and field names must be unique within a measurement.
Schema gets defined and created on the fly as you write data in.
But it’s a start. And we know that we can switch to Parquet as our persistence format without any fear of some sort of data explosion.
We break data up into partitions. How data is partitioned can change over time, because each partition is self describing in terms of the summary metadata that specifies what tables it has, what columns each of those tables has, and what the summary information is for each of those columns like min, max, count, sum and potentially even bloom filters for identifiers.
This summary data is used by the planner at query time. Partition summaries are kept in memory and the query is analyzed to determine which partitions need to be queried to produce a result. Once in a partition, we brute force query against it, and if we have it in our segment store, that happens against compressed data without decompressing it. That is, we perform late materialization and only decompress the values we use.
This means that the partitioning scheme you choose has great impact on what your queries look like. This is why we let the users define it when they create a database/bucket. It can change on a per-database basis.
We can likely do better. We’re using RLE for the span IDs and trace IDs and we’d be better just going with dictionary without the RLE.
Notice that we have time in this example. If you’re looking up by some trace ID, where’d you get it? From a log line? You’ll have a timestamp associated with it. Use it.
If you’re partitioning your data by time, and in most cases this will be at least one of the criteria by which you partition your data, you can quickly narrow down the blocks of data to query against. If you have 2h partitions, then you’ll be able to find the spans you’re looking for by querying at most 2 partitions.
This returns the 10 rows in about 85 milliseconds. If you do the rough math on this it means it was able to brute force on about 1.1B rows/sec. Note that we didn’t actually process all those rows. It was operating on compressed data.
We can likely get this down by a bit more by removing the RLE compression for trace ID and span. Maybe another 2x improvement.
The specifics of the compressed in memory columnar store will definitely be the subject of some future tech talks.
Here’s what I think the real future is. The example I just showed takes a data center centric view. It assumes that all your data is getting pushed up to some central cluster. I think the future is federated. It operates at the edges as single nodes, it operates in factories in small clusters, and it operates in many data centers worldwide.
You’ll likely have high precision data that doesn’t make sense to replicate everything up to a central place. Or at least you’ll only replicated it in highly compressed form. The future distributed time series system isn’t a cluster that runs in a data center, even if it has rack aware capabilities and multi-region routing.
There’s no limit to the scale of time series data that we’ll be collecting over the coming decades. We need flexibility in how it’s replicated, queried, and stored.
* Created InfluxDB because we saw so many people re-inventing the wheel and we wanted Influx to be the basis of it
* However, the lack of distributed features left a gap in the market
* Infrastructure projects that fall under source available or community licenses severely limit the audience and what you can build
* InfluxDB IOx is dual-licensed under MIT and Apache 2 as is common in the Rust community. No community license, no source available license, no restrictions. You can build new projects using this code, you can build new businesses using this code, you can do whatever you want with it.
Conway’s law says that you ship your org chart. That is if you create two teams to build a system, you’ll get a system comprised of two parts.
I propose Dix’s maxim as it relates to open source and licensing generally, which is that your licensing strategy is your commercialization strategy, whether by accident or design.
The architecture approaches for IOx are deliberate choices because of not only the functionality and operational properties we wanted in the system, but also in how we plan to commercialize it.
InfluxDB IOx is designed to be a shared-nothing server that has an API giving the operator total control over how it behaves. However, the operator must make those changes as they are needed. Who does this operation and coordination?
In the most simple setups of a single server, you don’t worry about it. In two server setups you can likely get by with shell scripts and a cron job.
But the more complex your environment becomes, the more complicated this coordination becomes. It was a design goal for us to separate out the core database work from the operational work across a fleet of servers. We will create this software for our own needs to operate our cloud environment. However, our cloud may be different than yours. Your environment may be different. This is why the operational coordination is kept separate. So there is maximum flexibility in topology and configuration.
We plan to run the InfluxDB IOx open source bits as is in our own cloud. We won’t be running a fork, we’ll be running right off the main branch.
At the beginning of this talk I mentioned my introduction of InfluxDB to the world. And I titled it this.
I’ll be giving more talks about InfluxDB IOx over the coming months. But here’s how I’m thinking about it. Yes, it’s a distributed time series database. But it’s a lot more than just that.
It’s federated and this is a core part of its design. With time series and analytics data, the future is federated. The scale is larger than you’ll want to manage and push up to a single cluster. You’ll have edge, multiple data centers, and many thousands of potential nodes all communicating with each other.