Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
This presentation was inspired post read of "TimeSeries Databases" -- Ted Dunning & Ellen Friedman.
I have tried to summarize a lot of the previous bench marks. Hope others find it useful. The slides were compiled early 2015 so some of the results might have changed but the core literature should still hold.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
Benjamin Hopp (Solutions Architect) @ Imply:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets.
This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit.
Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics.
Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.
The most important contributor to a fast analytical setup is getting the data model right.
The talk will center around various choices you can make to prepare your data to get best possible query performance.
We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes.
We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.
We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage.
You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more.
And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
Apache Big Data Conference 2016, Vancouver BC: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer).
Abstract: There is a new open source time series database on the block that allows one to store billions of time series points and access them within a few milliseconds.
Chronix is a young but mature open source time series database that catches a compression rate of 98% compared to data in CSV files while an average query took 21 milliseconds. Chronix is built on top of Apache Solr, a bulletproof NoSQL database with impressive search capabilities. Chronix relies on Solr plugins and everyone who has a Solr running can create a new Chronix core within a few minutes.
In this presentation Florian shows how Chronix achieves its efficiency in both by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with pre-computed attributes, and by specialized time series query functions.
Internet of Things is a currently a burgeoning market, and is often associated with specialized data-stores. However PostgreSQL is just as capable at this use-case and can offer some compelling advantages. We’ll explore ways to store IoT data in PostgreSQL covering various ways to store and structure this kind of data. How range types and differing types of indexes can be of use. Also taking a quick look at some extensions designed for this use case. Then looking at powerful SQL features which can really help when analyzing IoT data streams, and how the power of a real SQL database can be a key advantage.
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
Large partitions shall no longer be a nightmare. That is the goal of CASSANDRA-11206.
100MB and 100,000 cells per partition is the recommended limit for a single partition in Cassandra up to 3.5. Exceeding these limits can cause a lot of trouble. Repairs and compactions could fail and reads cause out-of-memory failures.
This talk provides a deep-dive of the reasons for the previous limitations, why exceeding these limitations caused trouble, how the improvements in Cassandra 3.6 helps with big partitions and why you should not blindly let your partitions get huge.
About the Speaker
Robert Stupp Solution Architect, DataStax
Robert is working as a Solutions Architect at DataStax and is also a Committer to Apache Cassandra. Before joining DataStax he worked with his customers to architect and build distributed systems using Cassandra and has a long experience in building distributed backend systems mostly using Java as the preferred language of choice.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
About the Speaker
Markus Hofer IT Consultant, codecentric AG
Markus Hofer works as an IT Consultant for codecentric AG in Minster, Germany. He works on microservice architectures backed by DSE and/or Apache Cassandra. Markus supports and trains customers building cassandra based applications.
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon
n this session, you will learn about a solution developed in partnership between Intel and the Michael J. Fox foundation to enable breakthroughs in Parkinson's disease (PD) research, by leveraging wearable sensors and smartphone to monitor PD patient's motor movements 24/7. We'll elaborate on how we're using HBase for time-series data storage and integrating it with various stream, batch, and interactive technologies. We'll also review our efforts to create an interactive querying solution over HBase.
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
Running a Cassandra cluster in AWS that can store petabytes worth of data can be costly. This talk will detail the novel approach of using approximate data structures to keep costs low, yet retain insightful, and up to date query results. The talk will explore a number of real world examples from our environment to demonstrate the power of approximate data. It will cover: determining how many IP addresses are on a network, ranking IPs by traffic, and finally determining approximate min, max, and averages on values. The talk will also cover how this data is laid out in Cassandra, so that a query always returns up to date data, without burdening the compactor.
About the Speaker
Ben Kornmeier Engineer, ProtectWise
Ben is a Staff Engineer at ProtectWise. When he is not building realtime processing pipelines, he enjoys hiking, biking, and keeping his dog out of trouble.
I have a good shard key now what - Advanced ShardingDavid Murphy
Are you're currently sharding your deployment or thinking about doing so. It will cover what to expect and what you should consider during the process, with references to basic sharding resources. It will also cover what to look for when running a sharded cluster. Finally, it will provide an overview of a new tool that makes understanding chunk sizes easy.
Video in french at https://www.youtube.com/watch?v=9LNnNh63rBI
Sizing an Elasticsearch cluster has to consider many dimensions. In this presentation we go through the different elements and features you should consider to handle big and varying loads of log data.
Managing your black friday logs Voxxed LuxembourgDavid Pilato
Surveiller une application complexe n'est pas une tâche aisée, mais avec les bons outils, ce n'est pas si sorcier. Néanmoins, des périodes fortes telles que les opérations de type "Black Friday" (Vendredi noir) ou période de Noël peuvent pousser votre application aux limites de ce qu'elle peut supporter, ou pire, la faire crasher. Parce que le système est fortement sollicité, il génère encore davantage de logs qui peuvent également mettre à mal votre système de supervision.
Dans cette session, j'aborderai les bonnes pratiques d'utilisation de la suite Elastic pour centraliser et monitorer vos logs. Je partagerai également avec vous quelques trucs et astuces pour vous aider à passer sans souci vos Vendredis noirs !
Nous verrons :
* Les architectures de monitoring
* Trouver la taille optimale pour l'API _bulk
* Distribuer la charge
* Taille des index et des shards
* Optimiser les E/S disque
Vous ressortirez de la session avec : des bonnes pratiques pour bâtir son système de monitoring avec la suite Elastic, le tuning avancé pour optimiser les performances d'ingestion et de recherche.
This presentation was inspired post read of "TimeSeries Databases" -- Ted Dunning & Ellen Friedman.
I have tried to summarize a lot of the previous bench marks. Hope others find it useful. The slides were compiled early 2015 so some of the results might have changed but the core literature should still hold.
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
Benjamin Hopp (Solutions Architect) @ Imply:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets.
This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit.
Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics.
Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.
The most important contributor to a fast analytical setup is getting the data model right.
The talk will center around various choices you can make to prepare your data to get best possible query performance.
We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes.
We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.
We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage.
You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more.
And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
Apache Big Data Conference 2016, Vancouver BC: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer).
Abstract: There is a new open source time series database on the block that allows one to store billions of time series points and access them within a few milliseconds.
Chronix is a young but mature open source time series database that catches a compression rate of 98% compared to data in CSV files while an average query took 21 milliseconds. Chronix is built on top of Apache Solr, a bulletproof NoSQL database with impressive search capabilities. Chronix relies on Solr plugins and everyone who has a Solr running can create a new Chronix core within a few minutes.
In this presentation Florian shows how Chronix achieves its efficiency in both by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with pre-computed attributes, and by specialized time series query functions.
Internet of Things is a currently a burgeoning market, and is often associated with specialized data-stores. However PostgreSQL is just as capable at this use-case and can offer some compelling advantages. We’ll explore ways to store IoT data in PostgreSQL covering various ways to store and structure this kind of data. How range types and differing types of indexes can be of use. Also taking a quick look at some extensions designed for this use case. Then looking at powerful SQL features which can really help when analyzing IoT data streams, and how the power of a real SQL database can be a key advantage.
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016DataStax
Large partitions shall no longer be a nightmare. That is the goal of CASSANDRA-11206.
100MB and 100,000 cells per partition is the recommended limit for a single partition in Cassandra up to 3.5. Exceeding these limits can cause a lot of trouble. Repairs and compactions could fail and reads cause out-of-memory failures.
This talk provides a deep-dive of the reasons for the previous limitations, why exceeding these limitations caused trouble, how the improvements in Cassandra 3.6 helps with big partitions and why you should not blindly let your partitions get huge.
About the Speaker
Robert Stupp Solution Architect, DataStax
Robert is working as a Solutions Architect at DataStax and is also a Committer to Apache Cassandra. Before joining DataStax he worked with his customers to architect and build distributed systems using Cassandra and has a long experience in building distributed backend systems mostly using Java as the preferred language of choice.
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
Citus is a sharding extension for postgres that can efficiently distribute a wide range of SQL queries. It uses postgres' planner hook to transparently intercept and plan queries on "distributed" tables. Citus then executes the queries in parallel across many servers, in a way that delegates most of the heavy lifting back to postgres.
Within Citus, we distinguish between several types of SQL queries, which each have their own planning logic:
Local-only queries
Single-node “router” queries
Multi-node “real-time” queries
Multi-stage queries
Each type of query corresponds to a different use case, and Citus implements several planners and executors using different techniques to accommodate the performance requirements and trade-offs for each use case.
This talk will discuss the internals of the different types of planners and executors for distributing SQL on top of postgres, and how they can be applied to different use cases.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
About the Speaker
Markus Hofer IT Consultant, codecentric AG
Markus Hofer works as an IT Consultant for codecentric AG in Minster, Germany. He works on microservice architectures backed by DSE and/or Apache Cassandra. Markus supports and trains customers building cassandra based applications.
HBaseCon 2015: HBase as an IoT Stream Analytics Platform for Parkinson's Dise...HBaseCon
n this session, you will learn about a solution developed in partnership between Intel and the Michael J. Fox foundation to enable breakthroughs in Parkinson's disease (PD) research, by leveraging wearable sensors and smartphone to monitor PD patient's motor movements 24/7. We'll elaborate on how we're using HBase for time-series data storage and integrating it with various stream, batch, and interactive technologies. We'll also review our efforts to create an interactive querying solution over HBase.
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...DataStax
Running a Cassandra cluster in AWS that can store petabytes worth of data can be costly. This talk will detail the novel approach of using approximate data structures to keep costs low, yet retain insightful, and up to date query results. The talk will explore a number of real world examples from our environment to demonstrate the power of approximate data. It will cover: determining how many IP addresses are on a network, ranking IPs by traffic, and finally determining approximate min, max, and averages on values. The talk will also cover how this data is laid out in Cassandra, so that a query always returns up to date data, without burdening the compactor.
About the Speaker
Ben Kornmeier Engineer, ProtectWise
Ben is a Staff Engineer at ProtectWise. When he is not building realtime processing pipelines, he enjoys hiking, biking, and keeping his dog out of trouble.
I have a good shard key now what - Advanced ShardingDavid Murphy
Are you're currently sharding your deployment or thinking about doing so. It will cover what to expect and what you should consider during the process, with references to basic sharding resources. It will also cover what to look for when running a sharded cluster. Finally, it will provide an overview of a new tool that makes understanding chunk sizes easy.
Video in french at https://www.youtube.com/watch?v=9LNnNh63rBI
Sizing an Elasticsearch cluster has to consider many dimensions. In this presentation we go through the different elements and features you should consider to handle big and varying loads of log data.
Managing your black friday logs Voxxed LuxembourgDavid Pilato
Surveiller une application complexe n'est pas une tâche aisée, mais avec les bons outils, ce n'est pas si sorcier. Néanmoins, des périodes fortes telles que les opérations de type "Black Friday" (Vendredi noir) ou période de Noël peuvent pousser votre application aux limites de ce qu'elle peut supporter, ou pire, la faire crasher. Parce que le système est fortement sollicité, il génère encore davantage de logs qui peuvent également mettre à mal votre système de supervision.
Dans cette session, j'aborderai les bonnes pratiques d'utilisation de la suite Elastic pour centraliser et monitorer vos logs. Je partagerai également avec vous quelques trucs et astuces pour vous aider à passer sans souci vos Vendredis noirs !
Nous verrons :
* Les architectures de monitoring
* Trouver la taille optimale pour l'API _bulk
* Distribuer la charge
* Taille des index et des shards
* Optimiser les E/S disque
Vous ressortirez de la session avec : des bonnes pratiques pour bâtir son système de monitoring avec la suite Elastic, le tuning avancé pour optimiser les performances d'ingestion et de recherche.
Managing your black friday logs - Code EuropeDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Managing your Black Friday Logs NDC OsloDavid Pilato
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
* monitoring architectures
* optimal bulk size
* distributing the load
* index and shard size
* optimizing disk IO
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
Topics include:
monitoring architectures
optimal bulk size
distributing the load
cluster sizing
optimizing disk IO
monitoring queries
monitoring your monitoring system :P
Takeaway: best practices when building a monitoring system with the Elastic Stack, advanced tuning to optimize and increase event ingestion performance.
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mAKgJi.
Ian Nowland and Joel Barciauskas talk about the challenges Datadog faces as the company has grown its real-time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. They also talk about how the architecture has evolved, and what they are looking to in the future as they architect for a quadrillion points per day. Filmed at qconnewyork.com.
Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Joel Barciauskas currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
Learn how AWS can help you process and make better use of your data with meaningful insights.
Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
Learn about real time data processing with Amazon Kinesis.
Data engineering Stl Big Data IDEA user groupAdam Doyle
Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams.
We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark.
Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
In this InfluxDays NYC 2019 presentation, InfluxData VP of Products Tim Hall and Sales Engineer Sam Dillard discuss architecture patterns with InfluxEnterprise time series platform. They cover an overview of InfluxEnterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice. Presentation highlights include InfluxEnterprise cluster architecture and how to determine if you're ready for adopting InfluxEnterprise.
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices.
In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.
Similar to Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018 (20)
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Codemotion
Increased complexity makes it very hard and time-consuming to keep your software bug-free and secure. We introduce fuzz-testing as a method for automatically and continuously discovering vulnerabilities hidden in your code. The talk will explain how fuzzing works and how to integrate fuzz-testing into your Software Development Life Cycle to increase your code’s security.
Pompili - From hero to_zero: The FatalNoise neverending storyCodemotion
It was 1993 when we decided to venture in a beat'em up game for Amiga. The Catalypse's success story pushed me and my comrade to create something astonishing for this incredible game machine... but things went harder, assumptions were slightly different, and italian competitors appeared out of nowhere... the project died in 1996. Story ended? Probably not...
Il Commodore 65 è un prototipo di personal computer che Commodore avrebbe dovuto mettere in commercio quale successore del Commodore 64. Purtroppo la sua realizzazione si fermò appunto allo stadio prototipale. Racconterò l'affascinante storia del suo sviluppo ed il perchè della soppressione del progetto ormai ad un passo dalla immissione in commercio.
Rivivere l'ebbrezza di progettare un vecchio computer o una consolle da bar è oggi possibile sfruttando le FPGA, ovvero logiche programmabili che consentono a chiunque di progettare il proprio hardware o di ricrearne uno del passato. In questa sessione si racconta come dal reverse engineering dell'hardware di vecchie glorie come il Commodore 64 e lo ZX Spectrum sia stato possibile farle rivivere attraverso tecnologie oggi alla portata di tutti.
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Codemotion
There's a lot of talk about blockchain, but how does the technology behind it actually work? For developers, getting some hands-on experience is the fastest way to get familiair with new technologies. So let's build a blockchain, then! In this session, we're going to build one in plain old Java, and have it working in 40 minutes. We'll cover key concepts of a blockchain: transactions, blocks, mining, proof-of-work, and reaching consensus in the blockchain network. After this session, you'll have a better understanding of core aspects of blockchain technology.
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Codemotion
When was the last time you were truly lost? Thanks to the maps and location technology in our phones, a whole generation has now grown up in a world where getting lost is truly a thing of the past. Location technology goes far beyond maps in the palm of our hand, however. In this talk, we will explore how a ridesharing app works. How do we discover our destination?How do we find the closest driver? How do we display this information on a map? How do we find the best route?To answer these questions,we will be learning about a variety of location APIs, including Maps, Positioning, Geocoding etc.
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Codemotion
Eward Driehuis, SecureLink's research chief, will guide you through the bumpy ride we call the cyber threat landscape. As the industry has over a decade of experience of dealing with increasingly sophisticated attacks, you might be surprised to hear more attacks slip through the cracks than ever. From analyzing 20.000 of them in 2018, backed by a quarter of a million security events and over ten trillion data points, Eward will outline why this happens, how attacks are changing, and why it doesn't matter how neatly or securely you code.
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 - Codemotion
IoT revolution is ended. Thanks to hardware improvement, building an intelligent ecosystem is easier than never before for both startups and large-scale enterprises. The real challenge is now to connect, process, store and analyze data: in the cloud, but also, at the edge. We’ll give a quick look on frameworks that aggregate dispersed devices data into a single global optimized system allowing to improve operational efficiency, to predict maintenance, to track asset in real-time, to secure cloud-connected devices and much more.
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Codemotion
What if Virtual Reality glasses could transform your environment into a three-dimensional work of art in realtime in the style of a painting from Van Gogh? One of the many interesting developments in the field of Deep Learning is the so called "Style Transfer". It describes a possibility to create a patchwork (or pastiche) from two images. While one of these images defines the the artistic style of the result picture, the other one is used for extracting the image content. A team from TNG Technology Consulting managed to build an AI showcase using OpenCV and Tensorflow to realize such goggles.
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Codemotion
Blockchain (and Cryptocurrency) is an evolution of 20-year old research from scientists like Chaum, Lamport, and Castro & Liskov. Due to the current hype, it's hard to distinguish beneficial aspects of the technology from a desire for a "silver bullet" for device security, verifiable logistics, or "saving democracy". The problem: blockchain introduces new security challenges - and blind adoption without understanding reduces overall security. In this talk, Melanie Rieback and Klaus Kursawe explain the pitfalls and limits of blockchain, so you can avoid making your applications LESS secure.
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Codemotion
Networking is a core part of computing in the digital world we inhabit. But, how well do you know how it works? Do you understand all the moving parts of the OSI stack inside your computer, and how the network is actually put together? How can this ever work? This guided safari of layers, standards, protocols, and happenstance will bring us close to the copper wire, and up through the layers of CDMA/CD, ARP, routing and HTTP. We will make a few excursions through patchworks that still work forty years later, and cleverly designed mechanisms that show that simplicity is the only way to last.
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Codemotion
Performance tests are not only an important instrument for understanding a system and its runtime environment. It is also essential in order to check stability and scalability – non-functional requirements that might be decisive for success. But won't my cloud hosting service scale for me as long as I can afford it? Yes, but… It only operates and scales resources. It won't automatically make your system fast, stable and scalable. This talk shows how such and comparable questions can be clarified with performance tests and how DevOps teams benefit from regular test practise.
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Codemotion
Sascha will demonstrate the opportunities and challenges of Conversational AI learned from the practice. Both Technology and User Experience will be covered introducing a process finding micro-moments, writing happy paths, gathering intents, designing the conversational flow, and finally publishing on almost all channels including Voice Services and Chatbots. Valuable for enterprises, developers, and designers. All live on stage in just minutes and with almost no code.
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Codemotion
A key challenge we face at Pacmed is quickly calibrating and deploying our tools for clinical decision support in different hospitals, where data formats may vary greatly. Using Intensive Care Units as a case study, I’ll delve into our scalable Python pipeline, which leverages Pandas’ split-apply-combine approach to perform complex feature engineering and automatic quality checks on large time-varying data, e.g. vital signs. I’ll show how we use the resulting flexible and interpretable dataframes to quickly (re)train our models to predict mortality, discharge, and medical complications.
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion
Coolblue is a proud Dutch company, with a large internal development department; one that truly takes CI/CD to heart. Empowerment through automation is at the heart of these development teams, and with more than 1000 deployments a day, we think it's working out quite well. In this session, Pat Hermens (a Development Managers) will step you through what enables us to move so quickly, which tools we use, and most importantly, the mindset that is required to enable development teams to deliver at such a rapid pace.
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...Codemotion
Quantum computers can use all of the possible pathways generated by quantum decisions to solve problems that will forever remain intractable to classical compute power. As the mega players vie for quantum supremacy and Rigetti announces its $1M "quantum advantage" prize, we live in exciting times. IBM-Q and Microsoft Q# are two ways you can learn to program quantum computers so that you're ready when the quantum revolution comes. I'll demonstrate some quantum solutions to problems that will forever be out of reach of classical, including organic chemistry and large number factorisation.
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Codemotion
Chinese food exploded across America in the early 20th century, rapidly adapting to local tastes while also spreading like wildfire. How was it able to spread so fast? The GY6 is a family of scooter engines that has achieved near total ubiquity in Europe. It is reliable and cheap to manufacture, and it's made in factories across China. How are these factories able to remain afloat? Chinese-American food and the GY6 are both riveting studies in product-market fit, and both are the product of a distributed open source-like development model. What lessons can we learn for open source software?
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Codemotion
The design space has exploded in size within the last few years and Sketch is one of the most important milestones to represent the phenomenon. But behind the scenes of this growing reality there is a remote team that revolutionizes the design space all without leaving the home office. This talk will present how Sketch has grown to become a modern, product designer's tool.
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Codemotion
Would you fly in a plane designed by a craftsman or would you prefer your aircraft to be designed by engineers? We are learning that science and empiricism works in software development, maybe now is the time to redefine what “Software Engineering” really means. Software isn't bridge-building, it is not car or aircraft development either, but then neither is Chemical Engineering. Engineering is different in different disciplines. Maybe it is time for us to begin thinking about retrieving the term "Software Engineering" maybe it is time to define what our "Engineering" discipline should be.
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Codemotion
What is the job of a CTO and how does it change as a startup grows in size and scale? As a CTO, where should you spend your focus? As an engineer aspiring to be a CTO, what skills should you pursue? In this inspiring and personal talk, I describe my journey from early Red Hat engineer to CTO at Bloomon. I will share my view on what it means to be a CTO, and ultimately answer the question: Should the CTO be coding?
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
18. The Elastic Journey of an Event
18
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Kibana
Instances (X)
X-Pack
Events
19. The Elastic Journey of an Event
19
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Kibana
Instances (X)
X-Pack
Events
OOTB Dashboards
20. The Elastic Journey of an Event
20
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
21. The Elastic Journey of an Event
21
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
22. The Elastic Journey of an Event
22
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
NotificationQueues Storage Metrics
23. The Elastic Journey of an Event
23
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
NotificationQueues Storage Metrics
Persistent
Queues
24. The Elastic Journey of an Event
24
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kafka
Kibana
Instances (X)
X-Pack
NotificationQueues Storage Metrics
Persistent
Queues
29. Shard: the basic working unit
• Each shard is a Lucene index
• Shards are not free
• Each shard adds some overhead
29
30. Not all shards are created equal
30
Node BNode A
d1
d2
d6
Primary
Index Twitter
Example: Index twitter ( primary:1 / rep.factor: 1)
31. Not all shards are created equal
31
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Index Twitter Index Twitter
Example: Index twitter ( primary:1 / rep.factor: 1)
32. Not all shards are created equal
32
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Index Twitter Index Twitter
Write
39. Not all shards are created equal
• You can change the # of replicas at anytime
39
PUT /twitter/_settings
{
"index" : {
"number_of_replicas" : 2
}
}
40. Not all shards are created equal
• You can change the # of replicas at anytime
• You can’t do exactly the same with primaries
40
PUT /twitter/_settings
{
"index" : {
"number_of_replicas" : 2
}
}
PUT /twitter/_settings
{
“index" : {
"number_of_shards" : 2
}
}
41. Not all shards are created equal
• You can change the # of replicas at anytime
• You can’t do exactly the same with primaries
41
PUT /twitter/_settings
{
"index" : {
"number_of_replicas" : 2
}
}
PUT /twitter/_settings
{
“index" : {
"number_of_shards" : 2
}
}
47. Shard Size
• Generally depends on many different factors
‒ document size, mapping, hardware, use case, kinds of queries
being executed, desired response time, peak indexing rate,
budget…
47
48. Shard Size
• Generally depends on many different factors
‒ document size, mapping, hardware, use case, kinds of queries
being executed, desired response time, peak indexing rate,
budget…
• Rules of thumb (logging use case only):
‒ shards have overhead: avoid ending up with a gazillion small
(~KB,MB) shards
‒ average shard size in the order of Gigabytes
‒ max ~30/40GB per shard
48
49. Sizing exercise
• ~1000 events per second
• 60s * 60m * 24h * 1000 events => ~87M events per day
• 1kb per event => ~82GB per day
• 3 months => ~7TB
• Simplification: Actual indexed data will take more space
49
50. Cluster my_cluster
Sizing exercise
• Data size: ~7TB
• Shard Size: ~10GB
• Total Primary Shards: ~716
• Replica factor: 1 -> 1432
50
3 months of logs
...
51. Cluster my_cluster
Sizing exercise
• Data size: ~7TB
• Shard Size: ~10GB
• Total Primary Shards: ~716
• Replica factor: 1 -> 1432
51
3 months of logs
...
• Total store size:14 TB total
• Assuming 16 GB Heap per node
• 1432 / (16GB x 15 Shards) = 5,9666
• Total Servers: ~6 (data nodes)
52. More about shard sizing
• https://www.elastic.co/elasticon/conf/2016/sf/quantitative-
cluster-sizing
• https://www.elastic.co/blog/how-many-shards-should-i-
have-in-my-elasticsearch-cluster
52
53. Time-Based Data
• Logs, social media streams, time-based events
• Timestamp + Data
• Do not change
• Typically search for recent events
• Older documents become less important
• Hard to predict the data size
• How do we handle all of this in terms on Indices?
53
57. Templates
• Every new created index starting with 'logs-' will have
‒ 2 shards
‒ 1 replica (for each primary shard)
‒ 60 seconds refresh interval
57
PUT _template/logs
{
"template": "logs-*",
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"refresh_interval": "60s"
}
}
More on that later
61. Rollover API
• Create new index when a condition is met
‒ document count
‒ index age OR size
61
PUT /logs-000001
{
"aliases": {
"logs_write": {}
}
}
# Add > 1000 documents to logs-000001
POST /logs_write/_rollover
{
"conditions": {
"max_age": "7d",
"max_docs": 1000,
"max_size": "5gb"
}
}
62. Rollover API
• Today can be automated through Curator
• Tomorrow will be part of Index Lifecycle Management
62
63. Cluster my_cluster
Do not Overshard
• 3 different logs
• 1 index per day each
• 1GB each
• 5 shards (default)
• 6 months retention
• ~900 shards for just
180GB of data
63
access-...
d6d3
d2
d5
d1
d4
application-...
d6d5
d9
d5
d1
d7
mysql-...
d10d59
d3
d5
d0
d4
71. Shrink API
• Shrink an existing index into a new one with fewer primaries
• Index must be marked as read-only
• Source index shards (Primaries or Replicas) must be on the
same node
71
PUT /my_source_index/_settings
{
"settings": {
"index.routing.allocation.require._name": “my_node",
"index.blocks.write": true
}
}
72. Shrink API
• Target index # of primaries must be a factor of source # of
primaries
• 15 primaries? Shrink to 3, 5 or 1
• Example shrinking down to 1 primary with replica factor 1
72
POST my_source_index/_shrink/my_target_index
{
"settings": {
"index.number_of_replicas": 1,
"index.number_of_shards": 1,
"index.codec": "best_compression"
}
}
73. Undersharded?
• Remember we write only to primaries
73
Cluster
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Master
Server 2
Master
Server 3
Master
74. Undersharded?
• Remember we write only to primaries
74
Cluster
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Master
Server 2
Master
Server 3
Master
75. Split API
• The inverse operation compared to the Shrink API
• Follows similar requirements
75
76. Post Splitting
• Remember we write only to primaries
76
Cluster
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Dedicated
Master
Server 2
Dedicated
Master
Server 3
Dedicated
Master
d5
d1
d5
d1
84. It depends...
• on your application (language, libraries, ...)
• document size (100b, 1kb, 100kb, 1mb, ...)
• number of nodes
• node size
• number of shards
• shards distribution
84
91. Clients
• Most APIs implement round robin
‒ you specify a seed list
‒ the client sniffs the cluster
‒ the client implement different selectors
• Logstash allows an array
‒ it can sniff the cluster
• Beats allows an array and no sniffing
91
and many more..
92. Clients
• Most APIs implement round robin
‒ you specify a seed list
‒ the client sniffs the cluster
‒ the client implement different selectors
• Logstash allows an array
‒ it can sniff the cluster
• Beats allows an array
92
97. Increasing write throughput - Knobs to turn
• Don’t need data immediately searchable?
‒ increase refresh_interval to 30s or 60s
‒ defaults to 1s
• Heavy indexing data node(hot)?
‒ consider increasing indexing buffer (divided by all active shards)
‒ defaults 10% of total heap
‒ increase index.translog.flush_threshold_size (defaults 512mb)
• Can afford data loss on node hw failure?
‒ set index.translog.durability to async (defaults to request)
97
98. We are hiring
• Work with a disruptive technology
• Engineering not an afterthought
• Diverse, inclusive and thriving environment
• High level of independence
• Work from anywhere (yes)
elastic.co/careers
98