The revolt against SQL continues at a steady but considerably slower pace. Bespoke database software seems to crop up daily in the name of performance or functionality. This talk will examine the ever growing field of monitoring systems and their respective databases, and look in depth as to how Postgres can be used in a number of these places. Systems of this nature are typically tasked with collecting and storing metrics from your infrastructure, drawing pretty graphs, and nagging you when things break.
Forms of data stored by these systems are nothing to be afraid of - they often include:
- Time series metrics - the history of a measurement over time, e.g. temperatures
- Logs - unstructured text emitted by applications, operating systems and hardware
- Events - schema-less but well structured notifications
An assertion of this talk is that for a majority of use cases, Postgres is more than capable of storing all of this data. We will attempt to replace numerous well known pieces of software with just one Postgres database. Of course we are told to use the right tool for the job, but having to learn and operate a single tool is a huge operational advantage.
We’ll get quite technical in this talk, take a look the data models and access patterns required, and how this can be fitted into the general purpose environment of Postgres. Additionally, it is always constructive to look at what can be problematic, and not just focus on the positives, and why many turn to other bespoke solutions.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
Expand data analysis tool at scale with ZeppelinDataWorks Summit
Apache Zeppelin is one of the tools to help users and developers enrich their analysis with beautiful visualization without any additional work. But recently, teams and cooperation started to use it as a team and a cooperate tool, and they are suffering. Thus it should be improved to be used in multiple teams and in a cooperation to overcome an individual tool.
I will explain how to configure Apache Zeppelin and its useful interpreters including Spark and JDBC to help multiple users and teams use it simultaneously, and how to adopt LDAP and Kerberos to authenticate and authorize valid users. The presentation also includes a specific example of line case, what to have developed for realizing these use cases, and the feature roadmap to make a more powerful tool in a production level. For a long time, Apache Zeppelin has focused on making a result beautiful, but now, it should do its efforts to make it a more convenient tool by hiding some sophisticated settings and providing easier configuration. JONGYOUL LEE, Software Development Engineer, LINE
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
Expand data analysis tool at scale with ZeppelinDataWorks Summit
Apache Zeppelin is one of the tools to help users and developers enrich their analysis with beautiful visualization without any additional work. But recently, teams and cooperation started to use it as a team and a cooperate tool, and they are suffering. Thus it should be improved to be used in multiple teams and in a cooperation to overcome an individual tool.
I will explain how to configure Apache Zeppelin and its useful interpreters including Spark and JDBC to help multiple users and teams use it simultaneously, and how to adopt LDAP and Kerberos to authenticate and authorize valid users. The presentation also includes a specific example of line case, what to have developed for realizing these use cases, and the feature roadmap to make a more powerful tool in a production level. For a long time, Apache Zeppelin has focused on making a result beautiful, but now, it should do its efforts to make it a more convenient tool by hiding some sophisticated settings and providing easier configuration. JONGYOUL LEE, Software Development Engineer, LINE
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide shows two high-tech efforts for performance improvement in Tajo project. First one is query optimization including cost-based join order and progressive optimization. The second effort is JIT-based vectorized processing.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Procella: A fast versatile SQL query engine powering data at YoutubeDataWorks Summit
Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis.
Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance.
Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas.
Speaker
Aniket Mokashi, Google, Senior Software Engineer
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
Effectively leveraging fast networking and storage hardware (e.g., RDMA, NVMe, etc.) in Apache Spark remains challenging. Current ways to integrate the hardware at the operating system level fall short, as the hardware performance advantages are shadowed by higher layer software overheads. This session will show how to integrate RDMA and NVMe hardware in Spark in a way that allows applications to bypass both the operating system and the Java virtual machine during I/O operations. With such an approach, the hardware performance advantages become visible at the application level, and eventually translate into workload runtime improvements. Stuedi will demonstrate how to run various Spark workloads (e.g, SQL, Graph, etc.) effectively on 100Gbit/s networks and NVMe flash.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
This talk is prepared as a bunch of slides, where each slide describes a really bad way people can screw up their PostgreSQL database and provides a weight - how frequently I saw that kind of problem. Right before the talk I will reshuffle the deck to draw ten random slides and explain you why such practices are bad and how to avoid running into them.
Linux internals for Database administrators at Linux Piter 2016PostgreSQL-Consulting
Input-output performance problems are on every day agenda for DBAs since the databases exist. Volume of data grows rapidly and you need to get your data fast from the disk and moreover - fast to the disk. For most databases there is a more or less easy to find checklist of recommended Linux settings to maximize IO throughput. In most cases that checklist is good enough. But it is always better to understand how it works, especially if you run into some corner-cases. This talk is about how IO in Linux works, how database pages travel from disk level to database own shared memory and back and what kind of mechanisms exist to control this. We will discuss memory structures, swap and page-out daemons, filesystems, schedullers and IO methods. Some fundamental differences in IO approaches between PostgreSQL, Oracle and MySQL will be covered.
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide shows two high-tech efforts for performance improvement in Tajo project. First one is query optimization including cost-based join order and progressive optimization. The second effort is JIT-based vectorized processing.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
Attend this session and walk away armed with solutions to the most common customer problems. Learn proactive configuration tweaks and best practices to keep your cluster free of fetch failures, job tracker hangs, and the like.
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Procella: A fast versatile SQL query engine powering data at YoutubeDataWorks Summit
Procella is a distributed SQL query engine built for flexible workloads within YouTube. Procella is highly scalable and is designed primarily to serve high volumes of queries at low latencies while ingesting realtime data. It is used to serve video/channel statistics for users watching videos as well as OLAP style queries for video analytics (youtube.com/analytics) and public dashboards (artists.youtube.com). Procella also supports complex SQL operations over structured data and is used by YouTube analysts for ad-hoc analysis.
Procella works on the Google distributed computing stack working directly on data residing in accessible columnar formats on the Google distributed file system Colossus. The underlying data is thus producible and directly consumable by other tools such as MapReduce and Dremel. The compute runs directly on shared machines on Borg clusters, and does not need dedicated virtual (or physical) machines. These features allows Procella to fit nicely in the Google ecosystem, scale compute and storage independently, and to gracefully handle evictions and machine failures without compromising availability or performance.
Procella has been in production for over two years and is currently serving billions of SQL queries per day across various workloads at YouTube and several other Google product areas.
Speaker
Aniket Mokashi, Google, Senior Software Engineer
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
Effectively leveraging fast networking and storage hardware (e.g., RDMA, NVMe, etc.) in Apache Spark remains challenging. Current ways to integrate the hardware at the operating system level fall short, as the hardware performance advantages are shadowed by higher layer software overheads. This session will show how to integrate RDMA and NVMe hardware in Spark in a way that allows applications to bypass both the operating system and the Java virtual machine during I/O operations. With such an approach, the hardware performance advantages become visible at the application level, and eventually translate into workload runtime improvements. Stuedi will demonstrate how to run various Spark workloads (e.g, SQL, Graph, etc.) effectively on 100Gbit/s networks and NVMe flash.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
PostgreSQL worst practices, version FOSDEM PGDay 2017 by Ilya KosmodemianskyPostgreSQL-Consulting
This talk is prepared as a bunch of slides, where each slide describes a really bad way people can screw up their PostgreSQL database and provides a weight - how frequently I saw that kind of problem. Right before the talk I will reshuffle the deck to draw ten random slides and explain you why such practices are bad and how to avoid running into them.
Linux internals for Database administrators at Linux Piter 2016PostgreSQL-Consulting
Input-output performance problems are on every day agenda for DBAs since the databases exist. Volume of data grows rapidly and you need to get your data fast from the disk and moreover - fast to the disk. For most databases there is a more or less easy to find checklist of recommended Linux settings to maximize IO throughput. In most cases that checklist is good enough. But it is always better to understand how it works, especially if you run into some corner-cases. This talk is about how IO in Linux works, how database pages travel from disk level to database own shared memory and back and what kind of mechanisms exist to control this. We will discuss memory structures, swap and page-out daemons, filesystems, schedullers and IO methods. Some fundamental differences in IO approaches between PostgreSQL, Oracle and MySQL will be covered.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
An investigation of how PostgreSQL and its latest capabilities (JSONB data type, GIN indices, Full Text Search) can be used to store, index and perform queries on structured Bibliographic Data such as MARC21/MARCXML, breaking the dependence on proprietary and arcane or obsolete software products.
Talk presented at FOSDEM 2016 in Brussels on 31/01/2016. This is a very practical & hands-on presentation with example code which is certainly not optimal ;)
My three-essay dissertation investigates recent practices of participatory infrastructure monitoring and their implications for the governance of urban infrastructure services. By introducing the concept of infrastructure legibility, the three essays of this dissertation explore ways to make waste systems and their governance more legible: its formal structure, its informal practices, interactions between the user and the provider, the individual and the system.
Part 1: Putting matter in place – monitoring waste transportation
Part 2: Tacit arrangements – data reporting challenges for recycling cooperatives in Brazil
Part 3: Infrastructure Legibility – a comparative analysis of open311-based citizen feedback systems
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
IT Executive Survey: Strategies for Monitoring IT Infrastructure & ServicesCA Technologies
This slide deck summarizes results from a survey to uncover key challenges IT executives face in monitoring IT Infrastructure & services. 47% of the respondents use more than 5 IT monitoring tools to monitor their IT infrastructure, while scalability and the pain of jumping from one monitoring tool to the other were one of key challenges mentioned. Learn more about CA Nimsoft Monitor Snap at http://www.ca.com/snap
Как в PostgreSQL устроено взаимодействие с диском, какие проблемы производительности при этом бывают и как их решать выбором подходящего hardware, настройками операционной системы и настройками PostgreSQL
Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at Li...Grier Johnson
A high level overview of the pieces you need to have a useful metrics and alerting infrastructure and some key lessons learned when building these pieces out at LinkedIn.
Microsoft Infrastructure Monitoring using OpManagerManageEngine
Microsoft is a well known vendor in Enterprise IT market. This presentation explains how to monitor Microsoft products using ManageEngine OpManager. and basic of Microsoft infrastructure monitoring.
The Open source market is getting overcrowded with different Network monitoring solutions, and not without reason, monitoring your infrastructure become more important each day, you have to know what's going on for your boss, your customers and for yourself. Nagios started the evolution, but today OpenNMS, Zabix, Zenoss, Groundworks, Hyperic and different others are showing up in the market. Do you want lightweight, or feature full, how far do you want to go with your monitoring, just on os level, or do you want to dig into your applications, do you want to know how many query per seconds your MySQL database is serving, or do you want to know about the internal state of your JBoss, or be triggered if the OOM killer will start working soon. This presentation will guide the audience trough the different alternatives, based on our experiences in the field. We will be looking both at alerting and trending and how easy or difficult it is to deploy such an environment.
Riga dev day: Lambda architecture at AWSAntons Kranga
My recent talk at Riga DevDay about Lambda architect at AWS. It illustrates few design simplifications that we can get when we implement Lambda Architecture in Cloud Native way
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Buesing | Current 2022
Businesses need to react to results immediately; to achieve this, real-time processing is becoming a requirement in many analytic verticals. But sometimes, the move from batch to real-time can leave you in a pinch. How do you handle and correct mistakes in your data? How do you migrate a new system to real-time along with historical data?
Let’s start with how to run Apache Druid locally with your containerized-based development environment. While streaming real-time events from Kafka into Druid, an S3 Complaint Store captures messages via Kafka Connect, for historical processing. An exploration of performance implications when the real-time stream of events contains historical data and how that affects performance and the techniques to prevent those issues, leaving a high-performance analytic platform supporting real-time and historical processing.
You’ll leave with the tools of doing real-time analytic processing and historical batch processing from a single source of truth. Your Druid cluster will have better rollups (pre-computed aggregates) and fewer segments, which reduces cost and improves query performance.
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Chris Larsen (Yahoo!)
This year we'll talk about the joys of the HBase Fuzzy Row Filter, new TSDB filters, expression support, Graphite functions and running OpenTSDB on top of Google’s hosted Bigtable. AsyncHBase now includes per-RPC timeouts, append support, Kerberos auth, and a beta implementation in Go.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...Altinity Ltd
OSA Con 2022: Switching Jaeger Distributed Tracing to ClickHouse to Enable Advanced Performance Management
Satbir Chahal - OpsVerse
Our team switched our Jaeger (open source project used for distributed tracing) storage backend to ClickHouse (from Cassandra), which opened the door to a world of advanced analytics that we can run and provide our users. This talk will describe the journey from the switch, the learning curve, the challenges, and the eventual wins.
MySQL performance monitoring using Statsd and GraphiteDB-Art
This session will explain how you can leverage the MySQL-StatsD collector, StatsD and Graphite to monitor your database performance with metrics sent every second. In the past few years Graphite has become the de facto standard for monitoring large and scalable infrastructures.
This session will cover the architecture, functional basics and dashboard creation using Grafana. MySQL-StatsD is really easy to set up and configure. It will allow you to fetch your most important metrics from MySQL, run your own custom queries to parse your production data and if necessary transform this data into something different that can be used as a metric. Having this data with a fine granularity allows you to correlate your production data, system metrics with your MySQL performance metrics.
MySQL-StatsD is a daemon written in Python that was created during one of the hackdays at my previous employer (Spil Games) to solve the issue of fetching data from MySQL using a light weight client and send metrics to StatsD. I currently maintain this open source project on Github as it is my duty as creator of the project to look after it.
Transforming Mobile Push Notifications with Big Dataplumbee
How we at Plumbee collect and process data at scale and how this data is used to send relevant mobile push notifications to our players to keep them engaged.
Presented as part of a Tech Talk: http://engineering.plumbee.com/blog/2014/11/07/tech-talk-push-notifications-big-data/
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
James Sweeney presents on "PuppetDB: A Single Source for Storing Your Puppet Data" at Puppet User Group NYC.
Video: http://www.youtube.com/watch?v=HTr4b02aU7A
Puppet NYC: http://www.meetup.com/puppetnyc-meetings/
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
Opera chose Scylla over Cassandra to sync the data of millions of browsers to a back-end data repository. The results of the migration and further optimizations they made in their stack helped Opera to gain better latency/throughput and lower resources usage beyond their expectations.
Attend this session to learn how to
Migrate your data in a sane way, without any downtime
Connect a Python+Django web app to Scylla, how to use intranode sharding to improve your application
Similar to Infrastructure Monitoring with Postgres (20)
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
12. Monitoring Requirements
Fault finding and alerting
Notify me when a server or service is
unavailable, a disk needs replacing, ...
Fault post-mortem, pre-emption
Why did the outage occur and what can we
do to prevent it next time
13. Monitoring Requirements
Utilisation and efficiency analysis
Is all the hardware we own being used?
Is it being used efficiently?
Performance monitoring and profiling
How long are my web/database requests?
14. Monitoring Requirements
Auditing (security, billing)
Tracking users use of system
Auditing access to systems or resources
Decision making, future planning
What is expected growth in data, or users?
What of the current system is most used?
16. Existing Tools
Checking and Alerting
Agents check on machines or services
Report centrally, notify users via dashboard
Store history of events in database
39. Existing Tools
Commendable “right tool for the job” attitude, but…
How about Postgres?
Fewer points of failure
Fewer places to backup
Fewer redundancy protocols
One set of consistent data semantics
Re-use existing operational knowledge
51. Postgres for Metrics
Combine Series
average over all
for {series=cpu}
[time range/interval]
Read Series
for each {type}
for {series=cpu}
[time range/interval]
54. Postgres for Metrics
"metric": {
"timestamp": 1232141412,
"name": "cpu.percent",
"value": 42,
"dimensions": { "hostname": "dev-01" },
"value_meta": { … }
}
JSON Ingest Format
Known, well defined structure
Varying set of dimensions key/values
55. Postgres for Metrics
CREATE TABLE measurements (
timestamp TIMESTAMPTZ,
name VARCHAR,
value FLOAT8,
dimensions JSONB,
value_meta JSON
);
Basic Denormalised Schema
Straightforward mapping onto input data
Data model for all schemas
56. Postgres for Metrics
SELECT
TIME_ROUND(timestamp, 60) AS timestamp,
AVG(value) AS avg
FROM
measurements
WHERE
timestamp BETWEEN '2015-01-01Z00:00:00'
AND '2015-01-01Z01:00:00'
AND name = 'cpu.percent'
AND dimensions @> '{"hostname": "dev-01"}'::JSONB
GROUP BY
timestamp
Single Series Query
One hour window | Single hostname
Measurements every 60 second interval
57. Postgres for Metrics
SELECT
TIME_ROUND(timestamp, 60) AS timestamp,
AVG(value) AS avg,
dimensions ->> 'hostname' AS hostname
FROM
measurements
WHERE
timestamp BETWEEN '2015-01-01Z00:00:00'
AND '2015-01-01Z01:00:00'
AND name = 'cpu.percent'
GROUP BY
timestamp, hostname
Group Multi-Series Query
One hour window | Every hostname
Measurements every 60 second interval
58. Postgres for Metrics
SELECT
TIME_ROUND(timestamp, 60) AS timestamp,
AVG(value) AS avg
FROM
measurements
WHERE
timestamp BETWEEN '2015-01-01Z00:00:00'
AND '2015-01-01Z01:00:00'
AND name = 'cpu.percent'
GROUP BY
timestamp
All Multi-Series Query
One hour window | Every hostname
Measurements every 60 second interval
60. Postgres for Metrics
SELECT DISTINCT
JSONB_OBJECT_KEYS(dimensions)
AS d_name
WHERE
name = 'cpu.percent'
FROM
measurements
Dimension Name List Query
(for specific metric)
61. Postgres for Metrics
SELECT DISTINCT
dimensions ->> 'hostname'
AS d_value
WHERE
name = 'cpu.percent'
AND dimensions ? 'hostname'
FROM
measurements
Dimension Value List Query
(for specific metric and dimension)
63. Postgres for Metrics
CREATE TABLE measurements (
timestamp TIMESTAMPTZ,
name VARCHAR,
value FLOAT8,
dimensions JSONB,
value_meta JSON
);
CREATE INDEX ON measurements
(name, timestamp);
CREATE INDEX ON measurements USING GIN
(dimensions);
Indexes
Covers all necessary query terms
Using single GIN saves space, but slower
64. Postgres for Metrics
● Series Queries
● All, Group, Specific
● Varying Time Window/Interval
5m|15s, 1h|15s, 1h|300s, 6h|300s, 24h|300s
● Listing Queries
● Metric Names, Dimension Names & Values
● All, Partial
65. Postgres for Metrics
Single
Group
All
0 2000 4000 6000 8000 10000 12000
"Denormalised" Series Queries
5m (15s)
1h (15s)
1h (300s)
6h (300s)
24h (300s)
Duration (ms)
66. Postgres for Metrics
Single
Group
All
0 500 1000 1500 2000 2500
"Denormalised" Series Queries
5m (15s)
1h (15s)
1h (300s)
6h (300s)
24h (300s)
Duration (ms)
69. Postgres for Metrics
CREATE TABLE measurement_values (
timestamp TIMESTAMPTZ,
metric_id INT,
value FLOAT8,
value_meta JSON
);
CREATE TABLE metrics (
id SERIAL,
name VARCHAR,
dimensions JSONB,
);
Normalised Schema
Reduces duplication of data
Pre-built set of distinct metric definitions
70. Postgres for Metrics
CREATE FUNCTION get_metric_id (in_name VARCHAR, in_dims JSONB)
RETURNS INT LANGUAGE plpgsql AS $_$
DECLARE
out_id INT;
BEGIN
SELECT id INTO out_id FROM metrics AS m
WHERE m.name = in_name AND m.dimensions = in_dims;
IF NOT FOUND THEN
INSERT INTO metrics ("name", "dimensions") VALUES
(in_name, in_dims) RETURNING id INTO out_id;
END IF;
RETURN out_id;
END; $_$;
Normalised Schema
Function to use at insert time
Finds existing metric_id or allocates new
71. Postgres for Metrics
CREATE VIEW measurements AS
SELECT *
FROM measurement_values
INNER JOIN
metrics ON (metric_id = id);
CREATE INDEX metrics_idx ON
metrics (name, dimensions);
CREATE INDEX measurements_idx ON
measurement_values (metric_id, timestamp);
Normalised Schema
Same queries, use view to join
Extra index to help normalisation step
77. Postgres for Metrics
CREATE TABLE summary_values_5m (
timestamp TIMESTAMPTZ,
metric_id INT,
value_sum FLOAT8,
value_count FLOAT8,
value_min FLOAT8,
value_max FLOAT8,
UNIQUE (metric_id, timestamp)
);
Summarised Schema
Pre-compute every 5m (300s) interval
Functions to be applied must be known
78. Postgres for Metrics
CREATE FUNCTION update_summarise () RETURNS TRIGGER
LANGUAGE plpgsql AS $_$
BEGIN
INSERT INTO summary_values_5m VALUES (
TIME_ROUND(NEW.timestamp, 300), NEW.metric_id,
NEW.value, 1, NEW.value, NEW.value)
ON CONFLICT (metric_id, timestamp)
DO UPDATE SET
value_sum = value_sum + EXCLUDED.value_sum,
value_count = value_count + EXCLUDED.value_count,
value_min = LEAST (value_min, EXCLUDED.value_min),
value_max = GREATEST(value_max, EXCLUDED.value_max);
RETURN NULL;
END; $_$;
Summarised Schema
Entry for each metric/rounded time period
Update existing entries by aggregating
79. Postgres for Metrics
CREATE TRIGGER update_summarise_trigger
AFTER INSERT ON measurement_values
FOR EACH ROW
EXECUTE PROCEDURE update_summarise ();
CREATE VIEW summary_5m AS
SELECT *
FROM
summary_values_5m INNER JOIN metrics
ON (metric_id = id);
Summarised Schema
Trigger applies row to summary table
View mainly for convenience when querying
80. Postgres for Metrics
SELECT
TIME_ROUND(timestamp, 300) AS timestamp,
AVG(value) AS avg
FROM
measurements
WHERE
timestamp BETWEEN '2015-01-01Z00:00:00'
AND '2015-01-01Z06:00:00'
AND name = 'cpu.percent'
GROUP BY
timestamp
Combined Series Query
Six hour window | Every hostname
Measurements every 300 second interval
81. Postgres for Metrics
SELECT
TIME_ROUND(timestamp, 300) AS timestamp,
SUM(value_sum) / SUM(value_count) AS avg
FROM
summary_5m
WHERE
timestamp BETWEEN '2015-01-01Z00:00:00'
AND '2015-01-01Z06:00:00'
AND name = 'cpu.percent'
GROUP BY
timestamp
Combined Series Query
Use pre-aggregated summary table
Mostly the same; extra fiddling for AVG
87. Postgres for Metrics
● Need coarser summaries for wider
queries (e.g. 30m summaries)
● Need to partition data by day to:
● Retain ingest rate due to indexes
● Optimise dropping old data
● Much better ways to produce summaries
to optimise ingest, specifically:
● Process rows in batches of interval size
● Process asynchronous to ingest transaction
90. Postgres for Log Searching
Requirements
● Central log storage
● Trivially searchable
● Time bounded
● Filter ‘dimensions’
● Interactive query
times (<100ms)
91. Postgres for Log Searching
"log": {
"timestamp": 1232141412,
"message":
"Connect from 172.16.8.1:52690 to 172.16.8.10:5000 (keystone/HTTP)",
"dimensions": {
"severity": 6,
"facility": 16,
"pid": "39762",
"program": "haproxy"
"hostname": "dev-controller-0"
},
}
Log Ingest Format
Typically sourced from rsyslog
Varying set of dimensions key/values
92. Postgres for Log Searching
CREATE TABLE logs (
timestamp TIMESTAMPTZ,
message VARCHAR,
dimensions JSONB
);
Basic Schema
Straightforward mapping of source data
Allow for maximum dimension flexibility
93. Postgres for Log Searching
connection AND program:haproxy
Query Example
Kibana/Elastic style using PG-FTS
SELECT *
FROM logs
WHERE
TO_TSVECTOR('english', message)
@@ TO_TSQUERY('connection')
AND dimensions @> '{"program":"haproxy"}';
94. Postgres for Log Searching
CREATE INDEX ON logs
USING GIN
(TO_TSVECTOR('english', message));
CREATE INDEX ON logs
USING GIN
(dimensions);
Indexes
Enables fast text search on ‘message’
& Fast filtering based on ‘dimensions’
103. Postgres for Log Parsing
CREATE TABLE logs (
timestamp TIMESTAMPTZ,
message VARCHAR,
dimensions JSONB
);
Log Schema – Goals:
Parse message against set of patterns
Add extracted information as dimensions
104. Postgres for Log Parsing
Patterns Table
Store pattern to match and field names
CREATE TABLE patterns (
regex VARCHAR,
field_names VARCHAR[]
);
INSERT INTO patterns (regex, fields_names) VALUES (
'Connect from '
|| '(d+.d+.d+.d+):(d+) to (d+.d+.d+.d+):(d+)'
|| ' ((w+)/(w+))',
'{src_ip,src_port,dest_ip,dest_port,service,protocol}'
);
105. Postgres for Log Parsing
Log Processing
Apply all configured patterns to new rows
CREATE FUNCTION process_log () RETURNS TRIGGER
LANGUAGE PLPGSQL AS $_$
DECLARE
m JSONB; p RECORD;
BEGIN
FOR p IN SELECT * FROM patterns LOOP
m := JSONB_OBJECT(p.field_names,
REGEXP_MATCHES(NEW.message, p.regex));
IF m IS NOT NULL THEN
NEW.dimensions := NEW.dimensions || m
END IF;
END LOOP;
RETURN NEW;
END; $_$;
106. Postgres for Log Parsing
CREATE TRIGGER process_log_trigger
BEFORE INSERT ON logs
FOR EACH ROW
EXECUTE PROCEDURE process_log ();
Log Processing Trigger
Apply patterns as messages and extend
dimensions, as inserted into logs table
109. Requirements
● Offload data burden
from producers
● Persist as soon as
possible to avoid loss
● Handle high velocity
burst loads
● Data does not need
to be queryable
Postgres for Queueing
116. Conclusion… ?
● I view Postgres as a very flexible
“data persistence toolbox”
● ...which happens to use SQL
● Batteries not always included
● That doesn’t mean it’s hard
● Operational advantages of using
general purpose tools can be huge
● Use & deploy what you know & trust