Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020

•Download as PPTX, PDF•

0 likes•170 views

Paul Dix, CTO and co-founder of InfluxData, discussed the future of InfluxDB and the release of InfluxDB 2.0 Open Source. He explained that InfluxDB 2.0 has been rebuilt from the ground up to address limitations of the original InfluxDB like lack of distributed features and poor performance for high cardinality analytics data. The new database, called InfluxDB IOx, uses a columnar data store with parquet files and is designed to be distributed, federated, and able to run analytics at scale on high cardinality data.

Technology

Paul Dix
CTO & co-founder, InfluxData
@pauldix
North America Virtual
Experience 2020-11-10
The future of InfluxDB

Introducing InfluxDB,
an open source distributed
time series database

More Events
• Measurements
• Exceptions
• Page Views
• User actions
• Commits
• Deploys
• Things happening in time

Regular time series
t0 t1 t2 t3 t4 t6 t7
Samples at regular intervals

Irregular time series
t0 t1 t2 t3 t4 t6 t7
Events whenever they come in

Things you want to ask questions about,
visualize, or summarize over time.

InfluxDB is great for analytics*
*on lower cardinality data

InfluxDB open source lacks
distributed features

Requirements
• What cardinality?
• Analytics performance
• Separate compute from storage and tiered storage
• Operator defined Replication & Partitioning
• Able to run without locally attached storage
• Bulk data import and export
• Subscriptions
• Federated by design
• Embeddable scripting
• Greater compatibility

Iterate and Refactor or Rebuild the
Core?

Line Protocol
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000

Line Protocol
Measurement
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000

Line Protocol
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000
Tags

Line Protocol
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000
Fields

Line Protocol
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000
nanosecond
epoch

Line Protocol
Series
cpu,host=serverA,num=1,region=west#idle (1.667, 1492214400000000000)
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000
cpu,host=serverA,num=1,region=west#system (2342.2, 1492214400000000000)

Inverted Index
Series ID
1 - cpu,host=serverA,num=1,region=west#idle (1.667, 1492214400000000000)
2 - cpu,host=serverB,num=1,region=west#system (2342.2, 1492214400000000000)
cpu - [1, 2]
host=serverA - [1]
host=serverB - [2]
num=1 - [1, 2]
region=west - [1, 2]
Posting Lists

short for iron oxide, pronounced (eye-ox)

Not just object store
Object Store Abstraction
Local
Disk
S3
GCP
Cloud
Storage
In
Memory
Azure
Blob
Storage
Minio Ceph

Partition Key region, 1h bucket: ex: west-2020-11-10-11:00
west-2020-11-10-11:00 east-2020-11-10-11:00 west-2020-11-10-12:00 Partitions
block 1 block 2 Immutable Blocks
table table Tables of data
Parquet
file
Parquet
file
In-memory
compressed
Segment
In-memory
compressed
Segment
Physical
Layout
Mutable Write Buffer

Mapping InfluxDB into
Tables
cpu,host=serverA,num=1,region=west idle=1.667,system=2342.2 1492214400000000000
host num region idle system time
serverA 1 west 1.667 2342.2 1492214400000000000
Table: cpu

Real-World Compression
• 591GB TSM across 483 files
• 97GB compressed TSM with gzip (likely due to index size)
• Naive Parquet test:
• 118GB
• 246,140 files

In-memory Perf Preview (tracing example)
• env - production or staging environment
• data_centre - the region within a cloud vendor
• cluster - a specific cluster, e.g., a k8s cluster
• user_id - an id associated with the user that issued a request that was traced
• request_id - an id associated with a single request that started a trace
• trace_id - a single id associated with all spans in the trace
• node_id - the id of compute node that the trace execution ran across
• pod_id - the id of containers that the trace execution ran across
• span_id - a random id for every sample generated in the trace

Test data cardinalities
104,998,932 rows
• env - 2
• data_centre - 20
• cluster - 200
• user_id - 200,000
• request_id - 2,000,000
• trace_id - 10,000,000
• node_id - 2,000
• pod_id - 20,000
• span_id - ∞ (a new one for each sample row)

Test data sizes
104,998,932 rows ~ 12.5 GB RAM
• env column 301 B
• data_centre ~2.1 KB
• cluster ~19.7 KB
• user_id ~176 MB
• request_id ~816 MB
• trace_id ~1.6GB
• node_id ~204 KB
• pod_id ~2 MB
• span_id ~9.2GB
• duration ~840 MB
• time ~840 MB

Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;

Find spans for a trace
SELECT * FROM “traces”
WHERE “trace_id” = “0000MjNg” AND
“time” >= ‘2020-10-30 15:12’ AND
“time” < ‘2020-10-30 16:12’;
Returned in: 84.666665ms ~ 1.1B rows/sec

Flexible Replication Rules
• Synchronous & Asynchronous
• Push & Pull
• Request by request, batch, or bulk
• Partition to servers, groups of servers
• Total operator control via RESTful API

Dix’s maxim
“Your licensing strategy is your
commercialization strategy, whether by
accident or design”

Introducing InfluxDB IOx,
an open source distributed
time series database

Introducing InfluxDB IOx,
an open source federated
time series database

Introducing InfluxDB IOx,
an open source distributed
time series database
analytics database

Introducing InfluxDB IOx,
an open source distributed
time series database
columnar database

Introducing InfluxDB IOx,
an open source distributed
time series database
replication system

Introducing InfluxDB IOx,
an open source distributed
time series database
events processor

Introducing InfluxDB IOx,
an open source distributed
time series database
data lifecycle manager

Introducing InfluxDB IOx,
an open source distributed
time series database
edge processor and data store

Get Involved
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.
• Star & watch the repo at github.com/influxdata/influxdb_iox
• Find the InfluxDB IOx topic on community.influxdata.com
• Join the #influxdb_iox channel in our community Slack
• Join us on the 2nd Wednesday of every month at 8:30 AM Pacific Time for a
tech talk on InfluxDB IOx - influxdata.com/community-showcase/influxdb-tech-
talks/
• We’re hiring for Rust, distributed systems, and columnar databases expertise.
Email to recruiting@influxdata.com and CC me paul@influxdata.com.

In Pravega's first community meeting as a CNCF project, we overviewed experimental features of Pravega: * Schema Registry - preserving the structure of data in an unstructured storage system and controlling for safe schema evolution * Consumption-Based Retention - stream truncation based on subscriber positions * Simplified Long-Term Storage (SLTS) - abstracting the distributed management of segments while removing complicated problems such as fencing * SLTS Plugin for BookKeeper - an implementation of the SLTS interfaces for BlobIt! object stores on BookKeeper: https://github.com/diegosalvi/pravega-blobit-chunkmanager

Introducing Cloudian HyperStore 6.0

Cloudian

Dynamic Object Routing

Cloudian

Introduction to Container Storage Interface (CSI)

Idan Atias

Among the cool stuff we do at Silk, my colleagues and I develop the Silk CSI Plugin for customers who use our system as the storage layer for their Kubernetes workloads. Before deep diving into the code and as part of my ramp-up on this subject I prepared some slides that cover some basic and important information on this topic. These slides start by recapping some basic storage principals in containers and Kubernetes, continues with some more advanced use cases (including an "offline demo" of persisting Redis data on EBS volumes), and ends with a detailed information on the CSI solution itself. IMHO, reviewing these slides can improve your understanding on this matter and can get you started implementing your own CSI plugin. The main sources of information I used for preparing these slides are: * Official CSI docs * Kubernetes Storage Lingo 101 - Saad Ali, Google * Container Storage Interface: Present and Future - Jie Yu, Mesosphere, Inc.

Cloudian HyperStore Features and Benefits

Cloudian

DUG'20: 10 - Storage Orchestration for Composable Storage Architectures

Andrey Kudryavtsev

RSC's BasIS storage orchestration platform addresses complications with deploying DAOS storage. It simplifies DAOS deployment by dynamically composing DAOS clusters from servers' NVMe and PMEM resources over a fabric. This composable disaggregated approach provides flexibility to use PMEM nodes for different roles like DAOS or databases. The orchestration significantly improves on DAOS by making it deployable on existing heterogeneous servers and suitable for cloud environments. Performance tests show NVMe-over-Fabric with the orchestrator achieves similar throughput to local NVMe drives.

Backup multi-cloud solution based on named pipes

Leandro Totino Pereira

Cronicle is a multi-server task scheduler that can run jobs on multiple servers. Storreduce is a cloud storage deduplication solution that can reduce storage usage by up to 99% when backing up data to cloud object storage like S3. The proposed backup solution uses Cronicle to schedule backups, Storreduce for data deduplication, and named pipes for high-speed data transfer between servers and to S3. Differential backups are performed to reduce backup sizes and bandwidth usage.

Keeping your application’s latency SLAs no matter what

ScyllaDB

Businesses that once measured performance in seconds now measure it down to the millisecond and even the microsecond in order to provide optimal user experience. For a NoSQL database few things are more important than keeping latencies low and bounded. Yet some databases suffer latency spikes from such regular occurrences as Java Virtual Machine (JVM) “garbage collection,” context switches, database repair, cache flushes and so on. This makes long-tail latency very tricky to diagnose and fix, as it’s often a “whack-a-mole” exercise. In this session, we will cover: The systemic causes of latency spikes How to keep latencies bounded and predictable How to manage latency-inducing events How Scylla helps optimize for 99% latency of <1msec

Zabbix was experiencing performance issues due to large history tables in the database. To address this, the architecture was changed to store history data in Elasticsearch instead of database tables. This improved scalability and performance. The basic item and event data remained in the MariaDB database cluster. Zabbix proxies were also used to distribute load across multiple network segments. With this new architecture, history data is indexed in Elasticsearch without database tables, improving query speed and reducing database size.

Big Data on Cloud Native Platform

Sunil Govindan

Cisco: Cassandra adoption on Cisco UCS & OpenStack

DataStax Academy

How to Protect Big Data in a Containerized Environment

BlueData, Inc.

Every enterprise spends significant resources to protect its data. This is especially true in the case of big data, since some of this data may include sensitive or confidential customer and financial information. Common methods for protecting data include permissions and access controls as well as the encryption of data at rest and in flight. The Hadoop community has recently rolled out Transparent Data Encryption (TDE) support in HDFS. Transparent Data Encryption refers to the process whereby data is transparently encrypted by the big data application writing the data; it is not decrypted again until it is accessed by another application. The data is encrypted during its entire lifespan—in transit and at rest—except when it is being specifically accessed by a processing application. TDE is an excellent approach for protecting data stored in data lakes built on the latest versions of HDFS. However, it does have its challenges and limitations. Systems that want to use TDE require tight integration with enterprise-wide Kerberos Key Distribution Center (KDC) services and Key Management Systems (KMS). This integration isn’t easy to set up or maintain. These issues can be even more challenging in a virtualized or containerized environment where one Kerberos realm may be used to secure the big data compute cluster and a different Kerberos realm may be used to secure the HDFS filesystem accessed by this cluster. BlueData has developed significant expertise in configuring, managing, and optimizing access to TDE-protected HDFS. This session at the Strata Data Conference in March 2018 (by Thomas Phelan, co-founder and chief architect at BlueData) offers a detailed overview of how transparent data encryption works with HDFS, with a particular focus on containerized environments. You’ll learn how HDFS TDE is configured and maintained in an environment where many big data frameworks run simultaneously (e.g., in a hybrid cloud architecture using Docker containers). Moreover, you’ll learn how KDC credentials can be managed in a Kerberos cross-realm environment to provide data scientists and analysts with the greatest flexibility in accessing data while maintaining complete enterprise-grade data security. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63763

Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...

DataStax

During this session Ben Lackey (DataStax) and Ravi Madasu (Google) will cover best practices for quickly setting up a cluster on Google Cloud Platform (GCP) using both Google Compute Engine (GCE) and Google Container Engine (GKE) which is based on Kubernetes and Docker. About the Speakers Ben Lackey Partner Architect, DataStax I work in the Cloud Strategy group at DataStax where I concentrate on improving the integration between DataStax Enterprise and cloud platforms including Azure, GCP and Pivotal. Ravi Madasu Ravi Madasu is a program manager at Google, primarily focused on Google Cloud Launcher. He works closely with ISV partners to make their products and services available on the Google Cloud Platform providing a developer friendly deployment experience. He has 15+ years of experience, working in variety of roles such as software engineer, project manager and product manager. Ravi received a Masters degree in Information Systems from Northeastern University and an MBA from Carnegie Mellon University.

Reporting from the Trenches: Intuit & Cassandra

DataStax

Rekha Joshi presents on how Intuit uses the Cassandra database to enable personalized A/B testing and improve customer experiences. Intuit handles large volumes of customer data and required a database with high security, scalability, availability and tunable performance. Cassandra met these requirements and became Intuit's standard NoSQL database. Rekha discusses how Intuit leverages Cassandra's capabilities and provides best practices for effective Cassandra usage, configuration, and performance tuning.

Webinar how to build a highly available time series solution with kairos-db (1)

Julia Angell

Workshop - How to benchmark your database

ScyllaDB

Why you need benchmarks Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience. You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution. Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice. In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment. We will cover: Data model impact on performance and latency Client behavior related to database capabilities Failover and high availability testing Hardware selection and cluster configuration impact We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case. Attend this virtual workshop if you are: Looking to minimize the cost of your database deployment Making a database decision based on performance and scale data Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.

Data Pipelines with Spark & DataStax Enterprise

DataStax

This document discusses building data pipelines for both static and streaming data using Apache Spark and DataStax Enterprise (DSE). For static data, it recommends using optimized data storage formats, distributed and scalable technologies like Spark, interactive analysis tools like notebooks, and DSE for persistent storage. For streaming data, it recommends using scalable distributed technologies, Kafka to decouple producers and consumers, and DSE for real-time analytics and persistent storage across datacenters.

GCP for AWS Professionals

DoiT International

Overcoming Barriers of Scaling Your Database

ScyllaDB

Scaling distributed databases successfully requires meeting myriad challenges from physical distribution of your data across on-premises locations, public cloud vendors, geographies and political entities to adopting technologies to overcome fundamental operational bottlenecks. Join ScyllaDB's Peter Corless, director of technical advocacy, as he interviews Moreno Garcia y Silva, head of solution architecture, about how to navigate both technical ecosystem and database architectural challenges for this next tech cycle. Takeaways: - Recognizing and classifying barriers to scaling - Solutions to overcome scaling challenges - Upfront planning and real-time response

Building a GPU-enabled OpenStack Cloud for HPC - Blair Bethwaite, Monash Univ...

OpenStack

Audience Level Intermediate Synopsis M3 is the latest generation system of the MASSIVE project, an HPC facility specializing in characterization science (imaging and visualization). Using OpenStack as the compute provisioning layer, M3 is a hybrid HPC/cloud system, custom-integrated by Monash’s R@CMon Research Cloud team. Built to support Monash University’s next-gen high-throughput instrument processing requirements, M3 is half-half GPU-accelerated and CPU-only. We’ll discuss the design and tech used to build this innovative platform as well as detailing approaches and challenges to building GPU-enabled and HPC clouds. We’ll also discuss some of the software and processing pipelines that this system supports and highlight the importance of tuning for these workloads. Speaker Bio Blair Bethwaite: Blair has worked in distributed computing at Monash University for 10 years, with OpenStack for half of that. Having served as team lead, architect, administrator, user, researcher, and occasional hacker, Blair’s unique perspective as a science power-user, developer, and system architect has helped guide the evolution of the research computing engine central to Monash’s 21st Century Microscope. Lance Wilson: Lance is a mechanical engineer, who has been making tools to break things for the last 20 years. His career has moved through a number of engineering subdisciplines from manufacturing to bioengineering. Now he supports the national characterisation research community in Melbourne, Australia using OpenStack to create HPC systems solving problems too large for your laptop.

Cassandra on Docker @ Walmart Labs

DataStax Academy

This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.

IoT Architectural Overview - 3 use case studies from InfluxData

Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020

Similar to Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020 (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Experience NA 2020

Editor's Notes