By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
SignalFx engineer Rajiv Kurian's presentation on why we wrote our own Kafka consumer, the performance goals, and the performance gains achieved.
Download the slides to see animations showing hardware details. These slides were converged from Keynote to Powerpoint, so there may be some oddness with slide transitions!
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
From our Feb 25, 2016 webcast on operating Elasticsearch at scale, the metrics to monitor, and how to create low-noise meaningful alerts on Elasticsearch performance.
Performance optimization 101 - Erlang Factory SF 2014lpgauth
In order to scale AdGear's real-time bidding (RTB) platform, we've been optimizing system and application performance. In this talk, I'll share all my findings, from basic tips to advanced topics. I'll cover coding patterns, metric collection, tracing and much more!
Talk objectives:
- Share basic tips, common pitfalls, efficient coding patterns
- Introduce tooling for metric collection (statsderl, vmstats, system-stats, riak_sysmon)
- Highlight erlang trace (fprof, flame graphs)
- Expose advanced topics (VM tuning, lock-counter, systemtap, cpuset)
Target audience:
- Anyone looking to improve the performance of their Erlang application.
SignalFx engineer Rajiv Kurian's presentation on why we wrote our own Kafka consumer, the performance goals, and the performance gains achieved.
Download the slides to see animations showing hardware details. These slides were converged from Keynote to Powerpoint, so there may be some oddness with slide transitions!
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
From our Feb 25, 2016 webcast on operating Elasticsearch at scale, the metrics to monitor, and how to create low-noise meaningful alerts on Elasticsearch performance.
Performance optimization 101 - Erlang Factory SF 2014lpgauth
In order to scale AdGear's real-time bidding (RTB) platform, we've been optimizing system and application performance. In this talk, I'll share all my findings, from basic tips to advanced topics. I'll cover coding patterns, metric collection, tracing and much more!
Talk objectives:
- Share basic tips, common pitfalls, efficient coding patterns
- Introduce tooling for metric collection (statsderl, vmstats, system-stats, riak_sysmon)
- Highlight erlang trace (fprof, flame graphs)
- Expose advanced topics (VM tuning, lock-counter, systemtap, cpuset)
Target audience:
- Anyone looking to improve the performance of their Erlang application.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...InfluxData
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Telegraf
Network to Code, LLC is a network automation solution provider that helps companies transform the way their networks are deployed, managed, and consumed on a day-to-day basis by leveraging network automation, software development, and DevOps technologies and principles. They provide highly sought-after training and consulting services that integrate and deploy network automation technology solutions to improve reliability, security, efficiency, time to market, and customer satisfaction while reducing operational costs.
In this session Josh VanDeraa and David Flores from Network to Code will present how to monitor your network devices with Telegraf using both the SNMP and the gNMI input plugins. They will also present what the challenges are with ingesting the same type of data from different sources and how to remediate that by normalizing the data in Telegraf using processors.
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward
This session will introduce a new open-source project - Flink TensorFlow - that enables Flink programs to operate on data using TensorFlow machine learning models. Applications include real-time image processing, NLP, and anomaly detection. The session will: - Introduce TensorFlow and describe its component model which allows for model reuse across environments - Demonstrate how to use TensorFlow models in Flink ML and Flink Streaming environments - Present a roadmap and provide opportunities to contribute
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
Dylan Ferreira from FuseMail will share how they use the Syslog Telegraf plugin to help them troubleshoot their systems faster and with more success. Dylan will go over how to set up Rsyslog and Telegraf to filter logs then configure Kapacitor to help you look for interesting things in your raw logs to trigger alerts to your team. He will then bring this all together in a dashboard for your teams to use.
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
What do electric power sensing IoT devices, large area electric field surveys and an array with hundreds of data channels have in common? They’re all built using an IoT stack fueled by InfluxDB and designed to run in environments of intermittent network connectivity.
In the operational environments where U.S. Soldiers operate, network connectivity is not ensured due to jamming, intermittent 4G signals, or paperwork. To address these issues, the United States Army Research Laboratory runs InfluxDB in both the cloud and on the IoT device. When connectivity is available, the most recent data are replicated to the cloud with historical data replicated as possible. This allows them to design products that can leverage the cloud, but aren’t tied to it. As a result, they have been able to develop electric power monitors for installations and microgrids, strap sensors to vehicles for large area surveys, and combine sensors into arrays.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
Virtual training Intro to InfluxDB & TelegrafInfluxData
How to setup InfluxDB & Telgraf to pull metrics into your InfluxDB. An introduction to querying data with InfluxQL. Learn more and download the open source version of Telegraf now: https://www.influxdata.com/time-series-platform/telegraf/
Strategies and techniques to optimize Kafka brokers and producers to minimize data loss under huge traffic volume, limited configuration options, less ideal and constant changing environment and balance against cost.
Using eBPF to Measure the k8s Cluster HealthScyllaDB
As a k8s cluster-admin your app teams have a certain expectation of your cluster to be available to deploy services at any time without problems. While there is no shortage on metrics in k8s its important to have the right metrics to alert on issues and giving you enough data to react to potential availability issues. Prometheus has become a standard and sheds light on the inner behaviour of Kubernetes clusters and workloads. Lots of KPIs (CPU, IO, network. Etc) in our On-Premise environment are less precise when we start to work in a Cloud environment. Ebpf is the perfect technology that fulfills that requirement as it gives us information down to the kernel level. In 2018 Cloudflare shared an opensource project to expose custom ebpf metrics in Prometheus. Join this session and learn about: • What is ebpf? • What type of metrics we can collect? • How to expose those metrics in a K8s environment. This session will try to deliver a step-by-step guide on how to take advantage of the ebpf exporter.
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
SignalFx engineer Paul Ingram presented these slides at Cassandra Summit 2015.
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Read more: http://blog.signalfx.com/making-cassandra-perform-as-a-tsdb
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.
Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...InfluxData
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Telegraf
Network to Code, LLC is a network automation solution provider that helps companies transform the way their networks are deployed, managed, and consumed on a day-to-day basis by leveraging network automation, software development, and DevOps technologies and principles. They provide highly sought-after training and consulting services that integrate and deploy network automation technology solutions to improve reliability, security, efficiency, time to market, and customer satisfaction while reducing operational costs.
In this session Josh VanDeraa and David Flores from Network to Code will present how to monitor your network devices with Telegraf using both the SNMP and the gNMI input plugins. They will also present what the challenges are with ingesting the same type of data from different sources and how to remediate that by normalizing the data in Telegraf using processors.
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward
This session will introduce a new open-source project - Flink TensorFlow - that enables Flink programs to operate on data using TensorFlow machine learning models. Applications include real-time image processing, NLP, and anomaly detection. The session will: - Introduce TensorFlow and describe its component model which allows for model reuse across environments - Demonstrate how to use TensorFlow models in Flink ML and Flink Streaming environments - Present a roadmap and provide opportunities to contribute
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
Dylan Ferreira from FuseMail will share how they use the Syslog Telegraf plugin to help them troubleshoot their systems faster and with more success. Dylan will go over how to set up Rsyslog and Telegraf to filter logs then configure Kapacitor to help you look for interesting things in your raw logs to trigger alerts to your team. He will then bring this all together in a dashboard for your teams to use.
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
What do electric power sensing IoT devices, large area electric field surveys and an array with hundreds of data channels have in common? They’re all built using an IoT stack fueled by InfluxDB and designed to run in environments of intermittent network connectivity.
In the operational environments where U.S. Soldiers operate, network connectivity is not ensured due to jamming, intermittent 4G signals, or paperwork. To address these issues, the United States Army Research Laboratory runs InfluxDB in both the cloud and on the IoT device. When connectivity is available, the most recent data are replicated to the cloud with historical data replicated as possible. This allows them to design products that can leverage the cloud, but aren’t tied to it. As a result, they have been able to develop electric power monitors for installations and microgrids, strap sensors to vehicles for large area surveys, and combine sensors into arrays.
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
Let’s be honest: Running a distributed stateful stream processor that is able to handle terabytes of state and tens of gigabytes of data per second while being highly available and correct (in an exactly-once sense) does not work without any planning, configuration and monitoring. While the Flink developer community tries to make everything as simple as possible, it is still important to be aware of all the requirements and implications In this talk, we will provide some insights into the greatest operations mysteries of Flink from a high-level perspective: - Capacity and resource planning: Understand the theoretical limits. - Memory and CPU configuration: Distribute resources according to your needs. - Setting up High Availability: Planning for failures. - Checkpointing and State Backends: Ensure correctness and fast recovery For each of the listed topics, we will introduce the concepts of Flink and provide some best practices we have learned over the past years supporting Flink users in production.
LINCX is an OpenFlow switch written in Erlang and running on LING (Erlang on Xen). It shows some remarkable performance. The presentation discusses various speed-related optimizations.
Virtual training Intro to InfluxDB & TelegrafInfluxData
How to setup InfluxDB & Telgraf to pull metrics into your InfluxDB. An introduction to querying data with InfluxQL. Learn more and download the open source version of Telegraf now: https://www.influxdata.com/time-series-platform/telegraf/
Strategies and techniques to optimize Kafka brokers and producers to minimize data loss under huge traffic volume, limited configuration options, less ideal and constant changing environment and balance against cost.
Using eBPF to Measure the k8s Cluster HealthScyllaDB
As a k8s cluster-admin your app teams have a certain expectation of your cluster to be available to deploy services at any time without problems. While there is no shortage on metrics in k8s its important to have the right metrics to alert on issues and giving you enough data to react to potential availability issues. Prometheus has become a standard and sheds light on the inner behaviour of Kubernetes clusters and workloads. Lots of KPIs (CPU, IO, network. Etc) in our On-Premise environment are less precise when we start to work in a Cloud environment. Ebpf is the perfect technology that fulfills that requirement as it gives us information down to the kernel level. In 2018 Cloudflare shared an opensource project to expose custom ebpf metrics in Prometheus. Join this session and learn about: • What is ebpf? • What type of metrics we can collect? • How to expose those metrics in a K8s environment. This session will try to deliver a step-by-step guide on how to take advantage of the ebpf exporter.
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.
This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
SignalFx engineer Paul Ingram presented these slides at Cassandra Summit 2015.
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Read more: http://blog.signalfx.com/making-cassandra-perform-as-a-tsdb
AWS Loft Talk: Behind the Scenes with SignalFxSignalFx
Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.
Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.
Maxime Petazzoni, Software Engineer at SignalFx, presents how we use Docker and how we monitor containers in production.
SignalFx has been using using Docker since November 2013. We have running Docker in prod ever since we’ve had a “prod” and back when Docker’s README said “DO NOT RUN IN PRODUCTION”.
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemSignalFx
Presented at GlueCon 2015.
This presentation discusses SignalFx CTO and co-founder Phillip Liu's experience operating infrastructure and apps at massive scale and what drove the realization that monitoring is fundamentally an analytics problem now. Following on the heels of Adrian Cockroft's keynote that morning, Monitoring Microservices and Containers, this presentation went over real world examples of how modern monitoring for microservices wroks.
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx
Zenefits principal engineer Venkat Thiruvengadam and SignalFx engineer Maxime Petazzoni discuss operationalizing Docker at scale. Learn about the transition to a well-conceived microservices approach, the tools chosen to support these services, and the lessons learned from monitoring containers in production in a high-performance environment.
Go debugging and troubleshooting tips - from real life lessons at SignalFxSignalFx
Exploring tips and advice on writing production Go systems that are easy to debug and troubleshoot. Jack Lindamood from SignalFx presents patterns that facilitate this process.
Jack addresses tools built into Go you can take advantage of, build process techniques they've learned over time, and open source tools and libraries you can use that help troubleshoot your production code when things go wrong.
Read more here: http://blog.signalfx.com/a-pattern-for-optimizing-go
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
In this InfluxDays NYC 2019 presentation, InfluxData VP of Products Tim Hall and Sales Engineer Sam Dillard discuss architecture patterns with InfluxEnterprise time series platform. They cover an overview of InfluxEnterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice. Presentation highlights include InfluxEnterprise cluster architecture and how to determine if you're ready for adopting InfluxEnterprise.
Scaling HDFS to Manage Billions of FilesHaohui Mai
In this talk, we share our experience on designing and implementing a next-generation, scale-out architecture for HDFS. Particularly, we implement the namespace on top of a key-value store. Key-value stores are highly scalable and can be scaled out on demand. Our current prototype shows that under the new architecture the namespace of HDFS scales 10x better than the current generation with no performance loss, demonstrating that HDFS is capable of storing billions of files using the current hardware.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
JAVA MICROSERVICES FOR BIG DATA WITH LOW LATENCY - Per-Ake Minborg, CTO Speedment
By leveraging on memory mapped files (eg. Hazelcast, ChronicleMaps etc.), Speedment supports large Java Maps that easily can exceed the size of your server’s RAM. Because the Java Maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed or restarted very quickly. Data can be retrieved with predictable ultra-low latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive”. The mapped files can be terabytes which has been done in real world deployment cases and there can be a large number of microservices that shares these maps simultaneously.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
3. Agenda
1. Why we need to scale ingest
2. Basic properties and limitations of modern
hardware
3. Optimization techniques inspired by HPC
4. Results!
5. Q&A (hopefully!)
5. • High resolution:
• Up to 1 sec
• Streaming analytics:
• Charts/analytics update @1sec
• Real time
• Multidimensional metrics:
• Dimensions : representing customer, server etc
• Filter, aggregate : 99th-pct-latency-by-service,customer
SignalFx is an advanced monitoring platform for modern applications
7. SignalFx ingest library
Raw data in Rollup data out
TimeSeries 0 rollup
TimeSeries 1 rollup
TimeSeries 2 rollup
TimeSeries 3 rollup
TimeSeries 4 rollup
TimeSeries 5 rollup
TimeSeries 6 rollup
TimeSeries 7 rollup
TimeSeries 8 rollup
8. Issues identified (before applying HPC techniques)
• Expensive - too many servers
• Exhibits parallel slow down
• More threads = worse performance
• What did the profile say?
• Death by a thousand cuts
• The core library = 35% of profile
11. Cache Lines
• Data is transferred between memory and cache in blocks of
fixed size, called cache lines. Usually 64 bytes
• When the processor needs to read or write a location in
main memory, it first checks for a corresponding entry in the
cache. In the case of:
• a cache hit, the processor immediately reads or writes
the data in the cache line
• a cache miss, the cache allocates a new entry and
copies in data from main memory, then the request (read
or write) is fulfilled from the contents of the cache
• The memory subsystem makes two kinds of bets to help us:
• Temporal locality
• Spatial locality
12. Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
19. SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
20. SignalFx library benchmark
Rollup data out
Key Value
ID 0 TimeSeries rollup 0
ID 1 TimeSeries rollup 1
ID 2 TimeSeries rollup 2
ID 3 TimeSeries rollup 3
ID 4 TimeSeries rollup 4
….. …..
….. …..
Key 1M TimeSeries rollup 1M
Raw data in,
in random order,
one per Time Series.
50x
35% of the profile of
the entire application
21. SignalFx
Techniques inspired by HPC that have
improved our pipeline
Single threaded, event based architectures:
parallelize by running multiple copies of
single threaded code
22. Single threaded event based architectures
• Threads work on their own private
data (as much as possible)
• Communicate with other threads
using events/messages
23. SignalFx
local data
Network In thread Processor thread(s) Network out thread
Receive data
Process data
Write batched
data
Events
Events
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
24. SignalFx
Network In thread Processor thread(s) Network out thread
Receive data
Process data Write batched
data
local data
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
25. Single threaded event based architectures advantages
• It enables many other optimal choices like
• Compact array based data structures
• Buffer/object re-use
• Loosely coupled - easy to test
• Run multiple copies for parallelism
26. SignalFx
Ring
Buffer
Network In thread Worker thread(s) Network out thread
Receive data
Process data Write batched
data
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
27. SignalFx
Worker thread
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Receive data using
Async IO
Process data
synchronously
Write data using
Async IO
Worker thread Worker thread
local data local data local data
Key Value
key 5 value 5
key 6 value 6
key 7 value 7
key 8 value 8
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
Key Value
key 9 value 9
key 10 value 10
key 11 value 11
key 12 value 12
28. SignalFx
Ring
Buffer
Network thread Processor thread(s) Async IO thread
Receive data
Process data
Batched
IO calls
local data
1
2
3
4
Ring
Buffer
Ring
Buffer
5
Ring
Buffer
6
7
Key Value
key 1 value 1
key 2 value 2
key 3 value 3
key 4 value 4
29. Advice for threaded applications
• Threads should ideally reflect the actual
parallelism of the system.
• Avoid gratuitous over subscribing
• Exception: IO threads?
• DO NOT communicate unless you have to
30. SignalFx
Techniques inspired by HPC that have
improved our pipeline
Use compact, cache-conscious, array based
data structures with minimal indirection
32. Basic principles
• Strive for smaller data structures
• Extra computation is ok
• E.g. Compressing network data
• Design data structures that facilitate
processing multiple entries—big
arrays!
• Layout should reflect access patterns
33. Hash maps
• Hash maps look ups are NOT free!
• A lookup in a well implemented hash
map is by definition a cache miss
• Popular implementations like
java.util.HashMap can cause multiple
cache misses
41. Array of co-located key/value
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
42. Cache misses with no collision
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
43. Cache misses with collisions
Key Value
Key 0 Value 0
Key 1 Value 1
Key 2 Value 2
Key 3 Value 3
Key 4 Value 4
Key 5 Value 5
Key 6 Value 6
Key 7 Value 7
1
2
44. Hash map of key to index to an array of structs
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
45. Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1
46. Cache misses with collision
Value 0
Value 1
Value 2
Value 3
Value 4
Value 5
Value 6
Value 7
Value 8
Key Index
Key 0 1
Key 1 6
Key 2 4
Key 3 8
1 2
47. New library memory layout
TimeSeries rollup 0
TimeSeries rollup 1
TimeSeries rollup 2
TimeSeries rollup 3
TimeSeries rollup 4
TimeSeries rollup 5
TimeSeries rollup 6
TimeSeries rollup 7
TimeSeries rollup 8
ID Index
ID 0 1
ID 1 6
ID 2 4
ID 3 8
Raw data in Rollup out
48. Changing hash map implementations
• java.util.HashMap (uses separate chaining and boxes
primitives) to make a long -> int lookup
• Allocations galore
• net.openhft.koloboke primitive open hash map
• 45% improvement
For the JVM use libraries like https://github.com/
OpenHFT/Koloboke.
For C++ try https://github.com/preshing/
CompareIntegerMaps or similar.
49. Access patterns
Hot data
Cold data
object 0
object 1
object 2
object 3
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
Field 0 Field 1 Field 2 Field 3 Field 4
50. Group fields accessed together
Hot fields
Cold fields
object 1
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 0 Field 1 Field 2
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
Field 3 Field 4
51. Results of separating hot and cold data
A hot loop run about once every 500 ms
• Old - Hot and cold data kept together
• 5 cache lines per time series
• Took anywhere between 62-70 ms
• New - Hot and cold data kept separate
• 3 cache lines of hot data per time series
• Took anywhere between 40-45 ms
• 35% improvement
53. Old vs New
• Concurrent -> single threaded
• Locks gone
• Array based data structures
• Zero allocations
• Extensive batching and hardware prefetching
• Multiple hash maps -> a single hash map look up
62. 35% of the profile but 3.4x improvement?
• Amdahl’s law
• Max 1.54x improvement if 35% => 0%
• Why 3.4x ?
• When you use less cache, you leave more for
others - thus speeding up other code too
• Lesson
• A profiler is a necessary tool, but not a substitute for
informed design
77. Potential layout after GC
int
int
int
int
Object B
Object B1 Object B2
B1
B2
B (header)
B1*
B2*
Other data
B1 (header)
int
int
Other data
B2 (header
int
int
80. What the control and data planes do
In networking terminology:
• Data plane - Defines the part that
decides what to do with packets
arriving on an inbound interface—
Frequent
• Control plane - Defines the part that is
concerned with drawing the network
map or routing table—Infrequent
81. The goal of control and data plane separation
DO NOT slow the frequent path because
of the infrequent path
82. Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
while (1) {
process_data_using_configuration_variables();
}
Flag 0
Flag 1
Flag 2
Flag 3
83. Flag 0
Flag 1
Flag 2
Flag 3
Runtime configuration variables
Worker threadConfiguration variables
(volatile/atomic)
Setter thread
Flag 0
Flag 1
Flag 2
Flag 3
while (1) {
cache_configuration_variables();
process_a_ton_of_stuff();
}
Cached configuration
variables
84. Volatile/atomic flag vs cached local flag
• All run time flags (used on every data point) are
volatile/atomic loads
• All run time flags are cached and refreshed on each
run loop
• About 8% improvement in datapoint/second. Others
might see more or less