MySQL performance monitoring using Statsd and Graphite (PLUK2013)
Note: this is a placeholder for the presentation next Tuesday at the Percona Live London
This document provides an introduction and overview of StatsD, including:
- A brief history of StatsD and how it was originally created by Flickr and implemented by Etsy.
- An overview of the StatsD architecture which involves sending metrics from applications over UDP to the StatsD server, which then sends the data to Carbon over TCP.
- An explanation of the different metric types StatsD supports - counters, gauges, sets, and timings - and examples of common use cases.
- Instructions for installing and running a StatsD server as well as examples of using StatsD clients in Node.js and Java applications.
Initially presented at OpenWest 2014 conference.
Graphite and StatsD gather line series data and offer a robust set of APIs to access that data. While the tools are robust, the dashboards are straight from 1992 and alerting off the data is nonexistent. Nark, an opensource project, solves both of these problems. It provides easy to use dashboards and readily available alerts and notifications to users. It has been used in production at Lucid Software for almost a year. Related to Nark are the tools required to make Graphite highly available.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
The document summarizes a ClickHouse meetup that took place on October 2, 2018 in Paris. It includes an agenda with presentations on ClickHouse and case studies from ContentSquare, Storetail, and Pragma Analytics. The document also provides an introduction to ClickHouse including that it is an open source, real-time, column-oriented database management system developed by Yandex in 2012-2015. It highlights ClickHouse's speed, flexibility, and cost advantages compared to other analytical database systems.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.
To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.
Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.
This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.
Apache Gearpump is a lightweight, real-time streaming engine that was conceived at Intel in 2014 and became an Apache incubating project in 2016. It uses a message-driven architecture based on Akka actors to provide out-of-order processing, exactly-once semantics, flow control, and fault tolerance. Gearpump supports dynamic DAGs, local and distributed deployments, and compatibility with Storm APIs.
Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.
This document provides an introduction and overview of StatsD, including:
- A brief history of StatsD and how it was originally created by Flickr and implemented by Etsy.
- An overview of the StatsD architecture which involves sending metrics from applications over UDP to the StatsD server, which then sends the data to Carbon over TCP.
- An explanation of the different metric types StatsD supports - counters, gauges, sets, and timings - and examples of common use cases.
- Instructions for installing and running a StatsD server as well as examples of using StatsD clients in Node.js and Java applications.
Initially presented at OpenWest 2014 conference.
Graphite and StatsD gather line series data and offer a robust set of APIs to access that data. While the tools are robust, the dashboards are straight from 1992 and alerting off the data is nonexistent. Nark, an opensource project, solves both of these problems. It provides easy to use dashboards and readily available alerts and notifications to users. It has been used in production at Lucid Software for almost a year. Related to Nark are the tools required to make Graphite highly available.
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
The document summarizes a ClickHouse meetup that took place on October 2, 2018 in Paris. It includes an agenda with presentations on ClickHouse and case studies from ContentSquare, Storetail, and Pragma Analytics. The document also provides an introduction to ClickHouse including that it is an open source, real-time, column-oriented database management system developed by Yandex in 2012-2015. It highlights ClickHouse's speed, flexibility, and cost advantages compared to other analytical database systems.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
Krux is an infrastructure provider for many of the websites you
use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For
every request on those properties, Krux will get one or more as
well. We grew from zero traffic to several billion requests per
day in the span of 2 years, and we did so exclusively in AWS.
To make the right decisions in such a volatile environment, we
knew that data is everything; without it, you can't possibly make
informed decisions. However, collecting it efficiently, at scale,
at minimal cost and without burdening developers is a tremendous
challenge.
Join me in this session to learn how we overcame this challenge
at Krux; I will share with you the details of how we set up our
global infrastructure, entirely managed by Puppet, to capture over
a million data points every second on virtually every part of the
system, including inside the web server, user apps and Puppet itself,
for under $2000/month using off the shelf Open Source software and
some code we've released as Open Source ourselves. In addition, I’ll
show you how you can take (a subset of) these metrics and send them
to advanced analytics and alerting tools like Circonus or Zabbix.
This content will be applicable for anyone collecting or desiring to
collect vast amounts of metrics in a cloud or datacenter setting and
making sense of them.
Apache Gearpump is a lightweight, real-time streaming engine that was conceived at Intel in 2014 and became an Apache incubating project in 2016. It uses a message-driven architecture based on Akka actors to provide out-of-order processing, exactly-once semantics, flow control, and fault tolerance. Gearpump supports dynamic DAGs, local and distributed deployments, and compatibility with Storm APIs.
Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
This case study describes an approach to building quasi real-time OLAP cubes in Microsoft SQL Server Analysis Services to enable daily comparisons of production forecasts and outcomes. The cubes are partitioned by time to allow independent and frequent updates. Initial attempts failed due to deadlocks from simultaneous partition updates. The working solution takes advantage of a 6 day work week by switching partition dates on Saturdays only and reprocessing partitions then. This allows real-time and historical partition updates without gaps or overlaps in the data.
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath
This document discusses the care and feeding of large scale Graphite installations. It begins with introductions and then discusses Graphite components like carbon-cache, carbon-aggregator, carbon-relay and StatsD. It covers Graphite storage, installation, documentation, middleware, backups, monitoring and the web UI. It provides tips on tuning, debugging and visualizing metrics in Graphite.
This document discusses InfluxDB, an open-source time series database. It stores time stamped numeric data in structures called time series. The document provides an overview of time series data, describes how to install and use InfluxDB, and discusses features like its HTTP API, client libraries, Grafana integration for visualization, and benchmark results showing it has better performance for time series data than other databases.
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...InfluxData
InfluxDB IOx Tech Talks - December 2020
A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database
This session will start with a tech talk from an InfluxDB IOx team member. This is your chance to interact directly with Influxers who are available to answer your questions about all things InfluxDB IOx and time series — including Paul Dix, Founder and CTO of InfluxData. This event will last about an hour and there will be time for live Q&A.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
Graphite is a Python-based monitoring tool with a flexible data storage and retrieval system. It stores numeric time-series data in Whisper files with configurable precision. Data can be inserted and retrieved via its web interface or API. While not as full-featured as some competitors out of the box, it provides powerful aggregation, calculation, and visualization of metrics across many servers through its storage and processing architecture.
This document discusses techniques for scalable real-time processing and counting of streaming data. It outlines several approaches for counting distinct items and top items in a stream in real-time, including using hashes, bitmaps, Bloom filters, HyperLogLog counters, and Count-Min sketches. It also discusses using these techniques to power features like recommendations by analyzing item co-occurrence matrices from user activity streams.
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
1) NorCom is an IT company that provides consulting services for big data and information management, with customers in automotive, public, media, and finance sectors.
2) Time series analysis of sensor data from vehicles is important for automotive R&D, but requires parallel processing due to the large volumes of data.
3) NorCom developed a Spark API called DaSense to simplify working with multi-sensor time series data at scale, and explored parallelizing state machine analysis on vehicle data using Spark.
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
Airflow is an open source platform for authoring and monitoring data pipelines. It was developed at Airbnb to address challenges like opaque data lineage, steep learning curves as ecosystems grow, duplicated code, and scattered operational metadata. Airflow uses a Python-based DAG (directed acyclic graph) definition to programmatically author pipelines. It has a rich CLI and web UI and uses technologies like Python, Celery, Flask, SQLAlchemy, and Jinja. Operators allow running tasks like SQL queries, transfers, and sensors. Airflow has been scaled to process thousands of tasks daily across many teams and companies.
InfluxDB is an open source time series database that is written in Go. It is designed for storing large amounts of time series data and providing rapid query results. Data is stored in measurements, which contain tags, fields, and a timestamp. Queries use a SQL-like language to retrieve and aggregate time series data. Continuous queries allow data to be resampled and written to a different measurement on a periodic basis.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
This document compares the graph processing frameworks Dato and Spark GraphX. It details experiments run on four datasets from the Stanford Network Collection to test the frameworks' performance on triangle counting, PageRank, and connected components algorithms. The results show that Dato has faster execution times than GraphX for processing large graphs, though GraphX is free while Dato charges fees. Further experiments could compare the frameworks' abilities to combine graph and non-graph computations.
This document discusses metrics and the Graphite monitoring system. It describes the main components of Graphite including Carbon, which persists metrics to disk and supports replication and sharding. It also describes Whisper for data storage and aggregation, and the web interface for rendering graphs. The document provides an overview of how these components work together and tips for optimizing performance such as aggregating metrics before ingestion and controlling the metrics that can be sent. It also briefly mentions alternative time-series databases like InfluxDB and Cassandra that could be used in the future.
InfluxDB is an open source time series database written in Go that stores metric data and performs real-time analytics. It has no external dependencies. InfluxDB stores data as time series with measurements, tags, and fields. Data is written using a line protocol and can be visualized using Grafana, an open source metrics dashboard.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
This document describes Artmosphere, a data analytics platform for artwork data. It contains information on Artmosphere's data sources, which include 26,000 artworks and 45,000 artists from Artsy.net. It also describes Artmosphere's data pipeline, cluster setup, and how it performs tasks like searching artwork by title using Spark and Elasticsearch and tracking trends in real-time using Spark Streaming and Cassandra. The document discusses challenges faced and provides background on the author.
This document discusses monitoring MySQL performance using StatsD and Graphite. It provides an overview of the tools and how they are used. StatsD collects metrics from applications and services and sends them to Graphite for storage and visualization. The document describes how a custom MySQL StatsD daemon was created to gather MySQL metrics and send them to StatsD in real-time for high granularity monitoring and graphing in Graphite.
MySQL performance monitoring using Statsd and GraphiteDB-Art
This session will explain how you can leverage the MySQL-StatsD collector, StatsD and Graphite to monitor your database performance with metrics sent every second. In the past few years Graphite has become the de facto standard for monitoring large and scalable infrastructures.
This session will cover the architecture, functional basics and dashboard creation using Grafana. MySQL-StatsD is really easy to set up and configure. It will allow you to fetch your most important metrics from MySQL, run your own custom queries to parse your production data and if necessary transform this data into something different that can be used as a metric. Having this data with a fine granularity allows you to correlate your production data, system metrics with your MySQL performance metrics.
MySQL-StatsD is a daemon written in Python that was created during one of the hackdays at my previous employer (Spil Games) to solve the issue of fetching data from MySQL using a light weight client and send metrics to StatsD. I currently maintain this open source project on Github as it is my duty as creator of the project to look after it.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
This case study describes an approach to building quasi real-time OLAP cubes in Microsoft SQL Server Analysis Services to enable daily comparisons of production forecasts and outcomes. The cubes are partitioned by time to allow independent and frequent updates. Initial attempts failed due to deadlocks from simultaneous partition updates. The working solution takes advantage of a 6 day work week by switching partition dates on Saturdays only and reprocessing partitions then. This allows real-time and historical partition updates without gaps or overlaps in the data.
Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath
This document discusses the care and feeding of large scale Graphite installations. It begins with introductions and then discusses Graphite components like carbon-cache, carbon-aggregator, carbon-relay and StatsD. It covers Graphite storage, installation, documentation, middleware, backups, monitoring and the web UI. It provides tips on tuning, debugging and visualizing metrics in Graphite.
This document discusses InfluxDB, an open-source time series database. It stores time stamped numeric data in structures called time series. The document provides an overview of time series data, describes how to install and use InfluxDB, and discusses features like its HTTP API, client libraries, Grafana integration for visualization, and benchmark results showing it has better performance for time series data than other databases.
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...InfluxData
InfluxDB IOx Tech Talks - December 2020
A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database
This session will start with a tech talk from an InfluxDB IOx team member. This is your chance to interact directly with Influxers who are available to answer your questions about all things InfluxDB IOx and time series — including Paul Dix, Founder and CTO of InfluxData. This event will last about an hour and there will be time for live Q&A.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
Graphite is a Python-based monitoring tool with a flexible data storage and retrieval system. It stores numeric time-series data in Whisper files with configurable precision. Data can be inserted and retrieved via its web interface or API. While not as full-featured as some competitors out of the box, it provides powerful aggregation, calculation, and visualization of metrics across many servers through its storage and processing architecture.
This document discusses techniques for scalable real-time processing and counting of streaming data. It outlines several approaches for counting distinct items and top items in a stream in real-time, including using hashes, bitmaps, Bloom filters, HyperLogLog counters, and Count-Min sketches. It also discusses using these techniques to power features like recommendations by analyzing item co-occurrence matrices from user activity streams.
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
1) NorCom is an IT company that provides consulting services for big data and information management, with customers in automotive, public, media, and finance sectors.
2) Time series analysis of sensor data from vehicles is important for automotive R&D, but requires parallel processing due to the large volumes of data.
3) NorCom developed a Spark API called DaSense to simplify working with multi-sensor time series data at scale, and explored parallelizing state machine analysis on vehicle data using Spark.
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
Airflow is an open source platform for authoring and monitoring data pipelines. It was developed at Airbnb to address challenges like opaque data lineage, steep learning curves as ecosystems grow, duplicated code, and scattered operational metadata. Airflow uses a Python-based DAG (directed acyclic graph) definition to programmatically author pipelines. It has a rich CLI and web UI and uses technologies like Python, Celery, Flask, SQLAlchemy, and Jinja. Operators allow running tasks like SQL queries, transfers, and sensors. Airflow has been scaled to process thousands of tasks daily across many teams and companies.
InfluxDB is an open source time series database that is written in Go. It is designed for storing large amounts of time series data and providing rapid query results. Data is stored in measurements, which contain tags, fields, and a timestamp. Queries use a SQL-like language to retrieve and aggregate time series data. Continuous queries allow data to be resampled and written to a different measurement on a periodic basis.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
This document compares the graph processing frameworks Dato and Spark GraphX. It details experiments run on four datasets from the Stanford Network Collection to test the frameworks' performance on triangle counting, PageRank, and connected components algorithms. The results show that Dato has faster execution times than GraphX for processing large graphs, though GraphX is free while Dato charges fees. Further experiments could compare the frameworks' abilities to combine graph and non-graph computations.
This document discusses metrics and the Graphite monitoring system. It describes the main components of Graphite including Carbon, which persists metrics to disk and supports replication and sharding. It also describes Whisper for data storage and aggregation, and the web interface for rendering graphs. The document provides an overview of how these components work together and tips for optimizing performance such as aggregating metrics before ingestion and controlling the metrics that can be sent. It also briefly mentions alternative time-series databases like InfluxDB and Cassandra that could be used in the future.
InfluxDB is an open source time series database written in Go that stores metric data and performs real-time analytics. It has no external dependencies. InfluxDB stores data as time series with measurements, tags, and fields. Data is written using a line protocol and can be visualized using Grafana, an open source metrics dashboard.
Quick introduction about Apache Spark and how it fits in the cognitive world, how can we use it to help cognitive solutions as well as create distributed algorithms to predict and perform other machine learning tasks.
This document describes Artmosphere, a data analytics platform for artwork data. It contains information on Artmosphere's data sources, which include 26,000 artworks and 45,000 artists from Artsy.net. It also describes Artmosphere's data pipeline, cluster setup, and how it performs tasks like searching artwork by title using Spark and Elasticsearch and tracking trends in real-time using Spark Streaming and Cassandra. The document discusses challenges faced and provides background on the author.
This document discusses monitoring MySQL performance using StatsD and Graphite. It provides an overview of the tools and how they are used. StatsD collects metrics from applications and services and sends them to Graphite for storage and visualization. The document describes how a custom MySQL StatsD daemon was created to gather MySQL metrics and send them to StatsD in real-time for high granularity monitoring and graphing in Graphite.
MySQL performance monitoring using Statsd and GraphiteDB-Art
This session will explain how you can leverage the MySQL-StatsD collector, StatsD and Graphite to monitor your database performance with metrics sent every second. In the past few years Graphite has become the de facto standard for monitoring large and scalable infrastructures.
This session will cover the architecture, functional basics and dashboard creation using Grafana. MySQL-StatsD is really easy to set up and configure. It will allow you to fetch your most important metrics from MySQL, run your own custom queries to parse your production data and if necessary transform this data into something different that can be used as a metric. Having this data with a fine granularity allows you to correlate your production data, system metrics with your MySQL performance metrics.
MySQL-StatsD is a daemon written in Python that was created during one of the hackdays at my previous employer (Spil Games) to solve the issue of fetching data from MySQL using a light weight client and send metrics to StatsD. I currently maintain this open source project on Github as it is my duty as creator of the project to look after it.
On June 11 Thomas Dinsmore gave a nice outline on tools and technologies that are out there handling analytics in Hadoop. It is a must watch for anyone looking for what advance analytics Hadoop could deliver.
Please find video and slides below.
Synopsis
What is the state of play for advanced analytics in Hadoop? A year ago, options included "roll your own" and little else; today there are a number of serious open source and commercial options available, with new capabilities announced daily.
In this presentation, we begin with a brief overview of use cases for advanced analytics and a discussion of what types of analytics must run in Hadoop. We continue with an overview of available architectures. The presentation concludes with a hype-free survey of available open source and commercial software for advanced analytics in Hadoop.
Bio
Thomas W. Dinsmore is Director of Product Management for Revolution Analytics, a company that provides commercial support and services for open source R. In this role, Mr. Dinsmore closely tracks the market for commercial and open source software on all platforms, including Hadoop. Prior to joining Revolution Analytics, Mr. Dinsmore served as an Analytics Solution Architect for IBM Big Data, and as a Principal Consultant for Razorfish and SAS.
Mr. Dinsmore has hands-on experience with leading commercial and open source tools for advanced analytics, including SAS, SPSS, R, Oracle Data Mining across a range of platforms, including Hadoop, Netezza, Teradata and Oracle. He is certified in SAS 9.
In his career, Mr. Dinsmore has worked with more than 500 enterprises in the United States, Canada, Mexico, Venezuela, Chile, Brazil, the United Kingdom, Belgium, Italy, Turkey, Israel, Malaysia and Singapore.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.
In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevAltinity Ltd
The document outlines an agenda for a ClickHouse event, including presentations on ClickHouse features and usage. ClickHouse is introduced as a fast, flexible, and free open source columnar database management system. It is described as being very fast for analytical queries, with the ability to ingest and aggregate data at low latency and perform distributed computations and fault-tolerant data warehousing at large scale. Successful production deployments are listed in domains like analytics, advertising, operations, and security. The growth and adoption of ClickHouse is discussed.
Pivotal OSS meetup - MADlib and PivotalRgo-pivotal
With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
This document summarizes and compares several open source monitoring tools: Nagios, Graphite, StatsD, Logstash, and Sensu. Nagios is introduced as a commonly used tool that some love and some find frustrating. Graphite is described as a tool for storing and graphing time-series data. StatsD aggregates counters and timers and sends them to backend services like Graphite. Logstash is an tool for managing logs and events that can input, filter, and output data. Sensu is a monitoring router that connects check scripts to handler scripts to alert or process monitoring data. Examples are given for each tool and what types of metrics to collect.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
This document discusses pandas, a popular Python library for data analysis, and its limitations. It introduces Badger, a new project from DataPad that aims to address some of pandas' shortcomings like slow performance on large datasets and lack of tight database integration. The creator describes Badger as using compressed columnar storage, immutable data structures, and C kernels to perform analytics queries much faster than pandas or databases on benchmark tests of a multi-million row dataset. He envisions Badger becoming a distributed, multicore analytics platform that can also be used for ETL jobs.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
This document provides lessons learned from building the Dutch public broadcasting company's website omroep.nl. Key points include using Ruby on Rails, BDD with RSpec and Cucumber, caching everything possible, rescuing errors, testing extensively, and handling large amounts of external data from various XML/RSS feeds and APIs. Performance was optimized through techniques like moving static assets to a front proxy, page caching, fragment caching, and using Memcache. The team of 6 people built the CMS from scratch over 6 months.
This document summarizes lessons learned from building the Dutch public broadcasting company's website omroep.nl. Key points include:
- The site was built using Ruby on Rails with 6 developers over 6 months to handle 30,000-40,000 daily pageviews and traffic spikes.
- Extensive testing was done including over 2,000 RSpec tests and 410 Cucumber scenarios to help ensure quality.
- Caching was heavily used to improve performance including caching pages, fragments, and external data from feeds.
- Resilience was important given the large amounts of external data from various sources, and errors were rescued and logged.
- Ongoing monitoring and optimization was needed to
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
Christian Coté is an ETL architect and developer with experience using tools like DTS/SSIS, Hummungbird Genio, Informatica, and Datastage. He has worked in domains including pharmaceuticals, finance, insurance, and manufacturing. He specializes in data warehousing and business intelligence and is a Microsoft MVP for SQL Server.
Christian Coté is an ETL architect and developer with experience using tools like DTS/SSIS, Hummungbird Genio, Informatica, and Datastage. He has worked in domains including pharmaceuticals, finance, insurance, and manufacturing. He specializes in data warehousing and business intelligence and is a Microsoft MVP for SQL Server. He is also the co-leader of the Montreal SQL Pass chapter.
Similar to MySQL performance monitoring using Statsd and Graphite (PLUK2013) (20)
Percona Live London 2014: Serve out any page with an HA Sphinx environmentspil-engineering
Sphinx is a full-text search engine that Spil Games uses to provide fast and complex search across their databases and indexes. Some key ways Spil Games uses Sphinx include searching for games by title or URL, finding friends across their networks, and filtering search results based on browser capabilities. To ensure high availability, Spil Games implements distributed and mirrored Sphinx indexes across multiple nodes and uses load balancers. Benchmarking shows Sphinx significantly outperforms MySQL for certain search queries.
This document discusses Spil Games' use of Galera Replicator to improve the high availability and scalability of their MySQL databases. Some key points:
- Spil Games is migrating legacy master-master MySQL databases to Galera for synchronous replication and high availability.
- Their SSP (Storage Platform) is being moved from asynchronous master-master to Galera to avoid single points of failure.
- Challenges included backup procedures, flow control during replication, and schema changes during upgrades.
- Future plans include using Galera with OpenStack for database-as-a-service and exploring WAN replication.
Retaining globally distributed high availabilityspil-engineering
This document summarizes a presentation about how Spil Games achieves high availability for their globally distributed databases. It discusses using master-slave replication, multi-master replication, database clustering, and geographic redundancy. It also covers scaling out through horizontal partitioning and federated partitioning. The key points are abstracting the storage layer using a platform built with Erlang, utilizing MySQL and other databases, and sharding data using a bucket model.
Outgrowing an internet startup: database administration in a fast growing com...spil-engineering
Spil Games is a social gaming company that was founded in 2001 and has grown to 350 employees with 170 million monthly active users. As the company has grown from a startup to a large organization, its database infrastructure has struggled to scale effectively. The presentation outlines Spil Games' journey to professionalize its database engineering practices and implement a new "Spil Storage Platform" to support continued growth through a global ID and bucket-based data model with functional and geographic sharding. This will allow horizontal scaling through a master-master configuration across multiple database clusters.
This document provides an overview of a disco workshop on parallel computing and MapReduce. The workshop covers an introduction to parallel computing including algorithms, programming models, and applications. It then introduces MapReduce, covering its history, examples, and execution overview. The workshop teaches how to write MapReduce jobs with Disco and includes an example of CDN log processing. It aims to provide attendees with the skills needed to get started with Disco for large-scale data processing.
This document discusses total cost of ownership (TCO) for database infrastructure. It begins by introducing Spil Games and defining TCO. The main costs that drive TCO are then outlined, including capital expenses (CAPEX) like hardware purchases, licensing fees, and high availability solutions, as well as operating expenses (OPEX) such as hosting, support, and power. Specific examples of these costs are provided for a cluster of 6 database nodes over 5 years. The document concludes by discussing potential improvements like moving from hard disk drives to solid state drives to reduce costs over the long term.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
4. Facts
•
•
•
•
•
Company founded in 2001
350+ employees world wide
180M+ unique visitors per month
Over 60M registered users
45 portals in 19 languages
• Casual games
• Social games
• Real time multiplayer games
• Mobile games
• 35+ MySQL clusters
• 60k queries per second (3.5 billion qpd)
4
8. Existing monitoring systems we use(d)
•
•
•
•
Opsview/Nagios (mainly availability)
Cacti (using Baron Schwartz/Percona templates)
MONYog
Good ol’ RRD
8
9. Challenges
• Problems with existing systems
• Stats gathering through polling
• Data gets averaged out
• (Host) checks are run serial
• Slowdowns in a run means no/less data
• Setting up an SSH connection is slow
• Low granularity (1 to 5 minutes)
• Hardly scalable
• Difficult to correlate metrics
9
10. Difficult to add a new metric
host065
bash-3.2# netstat -s | grep "listen queue"
26 times the listen queue of a socket overflowed
host066
bash-3.2# netstat -s | grep "listen queue"
33 times the listen queue of a socket overflowed
10
12. What is Collectd?
•
•
•
•
Unix daemon that gathers system statistics
Over 90 (input/output) plugins
Plugin to send metrics to Graphite/Carbon
Very useful for system metrics
12
14. What is StatsD?
•
•
•
•
•
•
•
Front-end proxy for Graphite/Carbon (by Etsy)
NodeJS daemon (also other languages)
Receives UDP (on localhost)
Buffers metrics locally
Flushes periodically data to Graphite/Carbon (TCP)
Client libraries available in about any language
Send any metric you like!
14
18. What is Graphite?
• Highly scalable real-time graphing system
• Collects numeric time-series
• Backend daemon Carbon
• Carbon-cache: receives data
• Carbon-aggregator: aggregates data
• Carbon-relay: replication and sharding
• RRD or Whisper database
18
19. Graphite’s capabilities
• Each metric is in its own bucket
• Periods make folders
• prod.syseng.mmm.<hostname>.admin_offline
• Metric types
• Counters
• Gauge
• Retention can be set using a regex
• [mysql]
• pattern = ^prod.syseng.mysql..*$
• retentions = 2s:1d,1m:3d,5m:7d,1h:5y
19
24. Why use StatsD over Collectd?
• MySQL plugin for Collectd
• Sends SHOW STATUS
• No INNODB STATUS
• Plugin not flexible
• DBI plugin for Collectd
• Metrics based on columns
• Different granularity needed
• Separate daemon (with persistent connection)
• StatsD is easy as ABC
24
25. MySQL StatsD daemon
•
•
•
•
•
•
•
•
Written in Python
Rewritten and open sourced during a hackday
Gathers data every 0.5 seconds
Sends to StatsD (localhost) after every run
Easy configuration
Persistent connection
Baron Schwartz’ InnoDB status parser (cacti poller)
Other interesting metrics and counters
• Information Schema
• Performance Schema
• MariaDB specific
• Galera specific
• If you can query it, you can use it as a metric!
25
27. Example configuration
[daemon]
logfile = /var/log/mysql_statsd/daemon.log
pidfile = /var/run/mysql_statsd.pid
[statsd]
host = localhost
port = 8125
prefix = prd.mysql
include_hostname = true
[mysql]
host = localhost
username = mysqlstatsd
password =ub3rs3cr3tp@ss!
stats_types = status,variables,innodb,commit
query_variables = SHOW GLOBAL VARIABLES
interval_variables = 10000
query_status = SHOW GLOBAL STATUS
interval_status = 500
query_innodb = SHOW ENGINE INNODB STATUS
interval_innodb = 10000
query_commit = COMMIT
interval_commit = 5000
sleep_interval = 500
[metrics]
variables.max_connections = g
status.max_used_connections = g
status.connections = c
innodb.spin_waits = c
27
28. MySQL Multi Master patch
•
•
•
•
Perl (Net::Statsd)
Sends any status change to StatsD (localhost)
Non-blocking (thanks to UDP)
Draw as infinite in Graphite
28
31. What is important for you?
• Identify your KPIs
• Don’t graph everything
• More graphs == less overview
• Combine metrics
• Stack clusters
31
32. Correlate!
• Include other metrics into your graphs
• Deployments
• Failover(s)
• Combine application metrics with your database
• Other influences
• Launch of a new game
• Apple keynotes
32
33. Graphing
• Graphite Graphing Engine
• DIY
• Giraffe
• Readily available dashboards/tools
• Graph Explorer (vimeo)
• Team Dashboard
• Skyline (Etsy)
• Dashing (Shopify)
33
47. What challenges do we have?
•
•
•
•
•
•
•
Improve MySQL-statsd (extensive issue list)
No zoom in on graphs
Get Skyline to work and not cry wolf
Machine learning
Eternal hunger for more metrics
Abuse of the system
Hitting limits of SSD write performance
• Virident? Fusion-IO?
• Carbon OpenTSDB Graphite-web?
47
48. What lessons have we learned?
• Persistent connections + repeatable read
• History list skyrocketed
• More hackdays are needed!
• Too many metrics slows down graphing
• Too many metrics can kill a host
• EstatsD for Erlang
48
51. Thank you!
• Presentation can be found at:
http://spil.com/pluk2013
• MySQL Statsd can be found at:
http://spil.com/mysqlstatsd
http://github.com/spilgames/mysql-statsd
• If you wish to contact me:
art@spilgames.com
51
Editor's Notes
so that may be the reason our name is not widely known.
The three main brands:Girls, aimed at girls ages from 8 to 12Teens aimed at boys and girls 10 to 15and Family basically mothers playing with their childrenStrong domains localized over 19 different languagesspielen.com, juegos.com, gamesgames.com, games.co.uk, oyunonya.comAll content is localized
----- Meeting Notes (30-11-12 12:00) -----Abbreviations (try to pronounce)Theory too long, second part too brief.High Availability -> HA What do we do? Games!180M+Query numbers on DBsSome examples of portal namesSSP is abstraction layerSSP query exampleExplain why horizontal instead of verticalFunctional sharding slide!Explain why sattelite DCIntroduction to sattelite data centers (moving data to caching) but explain they do not own the dataInstead of example of migrating users, example of adding a new DCSlide 23: leave out slideWhy we chose erlang: remove pattern matching. Adds productivity: simplerAdd another example for buckets with a different backendSlide 22: partition on users, bucket and GIDs.It is not a mess in LAMP stack: the backend is just not scalables