Dealing with large databases is always a challenge. The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
PostgreSQL - backup and recovery with large databasesFederico Campoli
Life on a rollercoaster, backup and recovery with large databases
Dealing with large databases is always a challenge.
The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
The presentation is based on a real story. The names are changed in order to protect the innocents.
Slides from the Brighton PostgreSQL meetup presentation. An all around PostgreSQL exploration. The rocky physical layer, the treacherous MVCC’s swamp and the buffer manager’s garden.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
The document discusses PostgreSQL's internal architecture and components. It describes the data area, which stores data files on disk, and key directories like pg_xlog for write-ahead logs. It explains the buffer cache and clock sweep algorithm for managing memory, and covers the multi-version concurrency control (MVCC) which allows simultaneous transactions. TOAST storage is also summarized, which stores large data values externally.
The document discusses PostgreSQL and its capabilities. It describes how PostgreSQL was created in 1982 and became open source in 1996. It discusses PostgreSQL's support for large databases, high-performance transactions using MVCC, ACID compliance, and its ability to run on most operating systems. The document also covers PostgreSQL's JSON and NoSQL capabilities and provides performance comparisons of JSON, JSONB and text fields.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
This document is an introduction to PostgreSQL presented by Federico Campoli to the Brighton PostgreSQL Users Group. It covers the history and development of PostgreSQL, its features including data types, JSON/JSONB support, and performance comparisons. The presentation includes sections on the history of PostgreSQL, its features and capabilities, NOSQL support using JSON/JSONB, and concludes with a wrap up on PostgreSQL and related projects.
PostgreSQL - backup and recovery with large databasesFederico Campoli
Life on a rollercoaster, backup and recovery with large databases
Dealing with large databases is always a challenge.
The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
The presentation is based on a real story. The names are changed in order to protect the innocents.
Slides from the Brighton PostgreSQL meetup presentation. An all around PostgreSQL exploration. The rocky physical layer, the treacherous MVCC’s swamp and the buffer manager’s garden.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
The document discusses PostgreSQL's internal architecture and components. It describes the data area, which stores data files on disk, and key directories like pg_xlog for write-ahead logs. It explains the buffer cache and clock sweep algorithm for managing memory, and covers the multi-version concurrency control (MVCC) which allows simultaneous transactions. TOAST storage is also summarized, which stores large data values externally.
The document discusses PostgreSQL and its capabilities. It describes how PostgreSQL was created in 1982 and became open source in 1996. It discusses PostgreSQL's support for large databases, high-performance transactions using MVCC, ACID compliance, and its ability to run on most operating systems. The document also covers PostgreSQL's JSON and NoSQL capabilities and provides performance comparisons of JSON, JSONB and text fields.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
This document is an introduction to PostgreSQL presented by Federico Campoli to the Brighton PostgreSQL Users Group. It covers the history and development of PostgreSQL, its features including data types, JSON/JSONB support, and performance comparisons. The presentation includes sections on the history of PostgreSQL, its features and capabilities, NOSQL support using JSON/JSONB, and concludes with a wrap up on PostgreSQL and related projects.
The document discusses backup and recovery strategies in PostgreSQL. It describes logical backups using pg_dump, which takes a snapshot of the database and outputs SQL scripts or custom files. It also describes physical backups using write-ahead logging (WAL) archiving and point-in-time recovery (PITR). With WAL archiving enabled, PostgreSQL archives WAL files, allowing recovery to any point between backups by restoring the backup files and replaying the WAL logs. The document provides steps for performing PITR backups, including starting the backup, copying files, stopping the backup, and recovery by restoring files and using a recovery.conf file.
The document discusses PostgreSQL's physical storage structure. It describes the various directories within the PGDATA directory that stores the database, including the global directory containing shared objects and the critical pg_control file, the base directory containing numeric files for each database, the pg_tblspc directory containing symbolic links to tablespaces, and the pg_xlog directory which contains write-ahead log (WAL) segments that are critical for database writes and recovery. It notes that tablespaces allow spreading database objects across different storage devices to optimize performance.
pg_chameleon is a lightweight replication system written in python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The author's tool will talk about the history, the logic behind the functions available and will give an interactive usage example.
The document discusses PostgreSQL query planning and tuning. It covers the key stages of query execution including syntax validation, query tree generation, plan estimation, and execution. It describes different plan nodes like sequential scans, index scans, joins, and sorts. It emphasizes using EXPLAIN to view and analyze the execution plan for a query, which can help identify performance issues and opportunities for optimization. EXPLAIN shows the estimated plan while EXPLAIN ANALYZE shows the actual plan after executing the query.
pg_chameleon is a lightweight replication system written in
python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The history, the logic and the future of the tool.
This document discusses PostgreSQL point-in-time recovery (PITR). It explains that to enable PITR, the archive_mode must be enabled, WAL archiving must occur, and backups of the data directory and WAL archives are needed. During recovery, the data directory is restored, a recovery.conf file is created to set the restore_command and recovery target, and WAL files are replayed to recover to the desired point in time.
The document outlines an introduction to databases presentation using PostgreSQL. It includes an introduction to databases concepts, an overview of PostgreSQL, demonstrations of SQL commands like CREATE TABLE, INSERT, SELECT and JOIN in psql, and discussions of database administration and GUI tools. Exercises are provided for attendees to practice the concepts covered.
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
Whether the user needs to setup a permanent replica between MySQL and PostgreSQL or perform an engine migration, pg_chamaleon is the perfect tool for the job.
The talk will cover the history the current implementation and the future releases.
The audience will learn how to setup a replica from MySQL to PostgreSQL in few easy steps. There will be also a coverage on the lessons learned during the tool’s development cycle.
Using Elasticsearch as the Primary Data StoreVolkan Yazıcı
The biggest e-commerce company in the Netherlands and Belgium, bol.com, set out on a 4 year journey to rethink and rebuild their entire ETL (Extract, Transform, Load) pipeline, that has been cooking up the data used by its search engine since the dawn of time. This more than a decade old white-bearded giant, breathing in the dungeons of shady Oracle PL/SQL hacks, was in a state of decay, causing ever-increasing hiccups on production. A rewrite was inevitable. After drafting many blueprints, we went for a Java service backed by Elasticsearch as the primary storage! This idea brought shivers to even the most senior Elasticsearch consultants hired, so to ease your mind I’ll walk you through why we took such a radical approach and how we managed to escape our legacy.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
One of the strongest features of any database is its extensibility and PostgreSQL comes with a rich extension API. It allows you to define new functions, types, and operators. It even allows you to modify some of its core parts like planner, executor or storage engine. You read it right, you can even change the behavior of PostgreSQL planner. How cool is that?
Such freedom in extensibility created strong extension community around PostgreSQL and made way for a vast amount of extensions such as pg_stat_statements, citus, postgresql-hll and many more.
In this tutorial, we will look at how you can create your own PostgreSQL extension. We will start with more common stuff like defining new functions and types but gradually explore less known parts of the PostgreSQL's extension API like C level hooks which lets you change the behavior of planner, executor and other core parts of the PostgreSQL. We will see how to code, debug, compile and test our extension. After that, we will also look into how to package and distribute our extension for other people to use.
To get the best benefit from the tutorial, C and SQL knowledge would be beneficial. Some knowledge on PostgreSQL internals would also be useful but we will cover the necessary details, so it is not necessary.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
This presentation was delivered during Softshake 2013. Learn why RDBMS are not enought and why NoSQL help developers to scale their applications and provide agility.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
Pig is a useful tool for exploring data sets and describing data flows easily through scripts. However, it has some weaknesses like inconsistent syntax, lack of testing support, and an aging code base. While Pig works well for many companies, its suitability may decline as companies' data needs grow more complex with distributed teams of data scientists and engineers. For Pig to remain useful, it will need improvements in areas like types, performance, UDF support, and testing to better support "big data" teams in the future.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
The paperback version is available on lulu.com there http://goo.gl/fraa8o
This is the first volume of the postgresql database administration book. The book covers the steps for installing, configuring and administering a PostgreSQL 9.3 on Linux debian. The book covers the logical and physical aspect of PostgreSQL. Two chapters are dedicated to the backup/restore topic.
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
SQL has gone out of fashion lately—partly due to the NoSQL movement, but mostly because SQL is often still used like 20 years ago. As a matter of fact, the SQL standard continued to evolve during the past decades resulting in the current release of 2016. In this session, we will go through the most important additions since the widely known SQL-92. We will cover common table expressions and window functions in detail and have a very short look at the temporal features of SQL:2011 and row pattern matching from SQL:2016.
Links:
http://modern-sql.com/
http://winand.at/
http://sql-performance-explained.com/
The document discusses backup and recovery strategies in PostgreSQL. It describes logical backups using pg_dump, which takes a snapshot of the database and outputs SQL scripts or custom files. It also describes physical backups using write-ahead logging (WAL) archiving and point-in-time recovery (PITR). With WAL archiving enabled, PostgreSQL archives WAL files, allowing recovery to any point between backups by restoring the backup files and replaying the WAL logs. The document provides steps for performing PITR backups, including starting the backup, copying files, stopping the backup, and recovery by restoring files and using a recovery.conf file.
The document discusses PostgreSQL's physical storage structure. It describes the various directories within the PGDATA directory that stores the database, including the global directory containing shared objects and the critical pg_control file, the base directory containing numeric files for each database, the pg_tblspc directory containing symbolic links to tablespaces, and the pg_xlog directory which contains write-ahead log (WAL) segments that are critical for database writes and recovery. It notes that tablespaces allow spreading database objects across different storage devices to optimize performance.
pg_chameleon is a lightweight replication system written in python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The author's tool will talk about the history, the logic behind the functions available and will give an interactive usage example.
The document discusses PostgreSQL query planning and tuning. It covers the key stages of query execution including syntax validation, query tree generation, plan estimation, and execution. It describes different plan nodes like sequential scans, index scans, joins, and sorts. It emphasizes using EXPLAIN to view and analyze the execution plan for a query, which can help identify performance issues and opportunities for optimization. EXPLAIN shows the estimated plan while EXPLAIN ANALYZE shows the actual plan after executing the query.
pg_chameleon is a lightweight replication system written in
python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The history, the logic and the future of the tool.
This document discusses PostgreSQL point-in-time recovery (PITR). It explains that to enable PITR, the archive_mode must be enabled, WAL archiving must occur, and backups of the data directory and WAL archives are needed. During recovery, the data directory is restored, a recovery.conf file is created to set the restore_command and recovery target, and WAL files are replayed to recover to the desired point in time.
The document outlines an introduction to databases presentation using PostgreSQL. It includes an introduction to databases concepts, an overview of PostgreSQL, demonstrations of SQL commands like CREATE TABLE, INSERT, SELECT and JOIN in psql, and discussions of database administration and GUI tools. Exercises are provided for attendees to practice the concepts covered.
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
Whether the user needs to setup a permanent replica between MySQL and PostgreSQL or perform an engine migration, pg_chamaleon is the perfect tool for the job.
The talk will cover the history the current implementation and the future releases.
The audience will learn how to setup a replica from MySQL to PostgreSQL in few easy steps. There will be also a coverage on the lessons learned during the tool’s development cycle.
Using Elasticsearch as the Primary Data StoreVolkan Yazıcı
The biggest e-commerce company in the Netherlands and Belgium, bol.com, set out on a 4 year journey to rethink and rebuild their entire ETL (Extract, Transform, Load) pipeline, that has been cooking up the data used by its search engine since the dawn of time. This more than a decade old white-bearded giant, breathing in the dungeons of shady Oracle PL/SQL hacks, was in a state of decay, causing ever-increasing hiccups on production. A rewrite was inevitable. After drafting many blueprints, we went for a Java service backed by Elasticsearch as the primary storage! This idea brought shivers to even the most senior Elasticsearch consultants hired, so to ease your mind I’ll walk you through why we took such a radical approach and how we managed to escape our legacy.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Deep Postgres Extensions in Rust | PGCon 2019 | Jeff DavisCitus Data
Postgres relies heavily on an extension ecosystem, but that is almost 100% dependent on C; which cuts out developers, libraries, and ideas from the world of Postgres. postgres-extension.rs changes that by supporting development of extensions in Rust. Rust is a memory-safe language that integrates nicely in any environment, has powerful libraries, a vibrant ecosystem, and a prolific developer community.
Rust is a unique language because it supports high-level features but all the magic happens at compile-time, and the resulting code is not dependent on an intrusive or bulky runtime. That makes it ideal for integrating with postgres, which has a lot of its own runtime, like memory contexts and signal handlers. postgres-extension.rs offers this integration, allowing the development of extensions in rust, even if deeply-integrated into the postgres internals, and helping handle tricky issues like error handling. This is done through a collection of Rust function declarations, macros, and utility functions that allow rust code to call into postgres, and safely handle resulting errors.
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
One of the strongest features of any database is its extensibility and PostgreSQL comes with a rich extension API. It allows you to define new functions, types, and operators. It even allows you to modify some of its core parts like planner, executor or storage engine. You read it right, you can even change the behavior of PostgreSQL planner. How cool is that?
Such freedom in extensibility created strong extension community around PostgreSQL and made way for a vast amount of extensions such as pg_stat_statements, citus, postgresql-hll and many more.
In this tutorial, we will look at how you can create your own PostgreSQL extension. We will start with more common stuff like defining new functions and types but gradually explore less known parts of the PostgreSQL's extension API like C level hooks which lets you change the behavior of planner, executor and other core parts of the PostgreSQL. We will see how to code, debug, compile and test our extension. After that, we will also look into how to package and distribute our extension for other people to use.
To get the best benefit from the tutorial, C and SQL knowledge would be beneficial. Some knowledge on PostgreSQL internals would also be useful but we will cover the necessary details, so it is not necessary.
Data Modeling, Normalization, and De-Normalization | PostgresOpen 2019 | Dimi...Citus Data
As a developer using PostgreSQL one of the most important tasks you have to deal with is modeling the database schema for your application. In order to achieve a solid design, it’s important to understand how the schema is then going to be used as well as the trade-offs it involves.
As Fred Brooks said: “Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”
In this talk we're going to see practical normalisation examples and their benefits, and also review some anti-patterns and their typical PostgreSQL solutions, including Denormalization techniques thanks to advanced Data Types.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
This presentation was delivered during Softshake 2013. Learn why RDBMS are not enought and why NoSQL help developers to scale their applications and provide agility.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
Pig is a useful tool for exploring data sets and describing data flows easily through scripts. However, it has some weaknesses like inconsistent syntax, lack of testing support, and an aging code base. While Pig works well for many companies, its suitability may decline as companies' data needs grow more complex with distributed teams of data scientists and engineers. For Pig to remain useful, it will need improvements in areas like types, performance, UDF support, and testing to better support "big data" teams in the future.
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
Presentation of an investigation into how Python's RDFLib and SQLAlchemy can be used to leverage PostgreSQL's capabilities to provide a persistent storage back-end for Graphs, and become the elusive practical RDF triple store for the Semantic Web (or simply help you export your data to someone who's expecting RDF)!
Talk presented at FOSDEM 2017 in Brussels on 04-05/02/2017. Practical & hands-on presentation with example code which is certainly not optimal ;)
Video:
MP4: http://video.fosdem.org/2017/H.1309/postgresql_semantic_web.mp4
WebM/VP8: http://ftp.osuosl.org/pub/fosdem/2017/H.1309/postgresql_semantic_web.vp8.webm
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
The paperback version is available on lulu.com there http://goo.gl/fraa8o
This is the first volume of the postgresql database administration book. The book covers the steps for installing, configuring and administering a PostgreSQL 9.3 on Linux debian. The book covers the logical and physical aspect of PostgreSQL. Two chapters are dedicated to the backup/restore topic.
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
SQL has gone out of fashion lately—partly due to the NoSQL movement, but mostly because SQL is often still used like 20 years ago. As a matter of fact, the SQL standard continued to evolve during the past decades resulting in the current release of 2016. In this session, we will go through the most important additions since the widely known SQL-92. We will cover common table expressions and window functions in detail and have a very short look at the temporal features of SQL:2011 and row pattern matching from SQL:2016.
Links:
http://modern-sql.com/
http://winand.at/
http://sql-performance-explained.com/
1. El documento describe el proceso productivo de la industria del papel, incluyendo las materias primas, subprocesos, contaminación del agua, tratamiento de efluentes y normativa ambiental aplicable. 2. Las principales fuentes de contaminación de aguas son las materias primas fibrosas y los aditivos químicos utilizados, generando compuestos como organoclorados y dioxinas. 3. El tratamiento de efluentes incluye filtros, sedimentadores, aireación y tratamiento anaerobio para reducir parámetros como DBO,
Quite often "new" people are only "new" to Postgres. This is my summary of do's and don'ts when it comes to teaching Postgres, what to take note on, with emphasis on teaching
Our games at InnoGames are organised as many isolated worlds for limited numbers of players. All those run on their separate virtual machines. By being the only database administrator, I am partially responsible from thousands of databases. Any manual operation on PostgreSQL servers is not an option for us. I am going to focus on automating database administration tasks like configuration, user management, version upgrades, backup and recovery, replication... I will give examples from Puppet , but the same methods can be applied to other configuration management systems.
PostgreSQL, performance for queries with groupingAlexey Bashtanov
The talk will cover PostgreSQL grouping and aggregation facilities and best practices of using them in fast and efficient manner.
In 40 minutes the audience will learn several techniques to optimise queries containing GROUP BY, DISTINCT or DISTINCT ON keywords.
El documento presenta información sobre el Instituto Cristo del Picacho, incluyendo su misión, visión, modalidades, requisitos de matrícula, instalaciones y el programa Técnico Bilingüe en Call Center. El programa busca formar jóvenes en inglés, atención al cliente y computación para que puedan trabajar en call centers. Los estudiantes reciben clases intensivas de inglés, apoyo académico y no pagan colegiatura.
Este documento discute las ventajas y desventajas tecnológicas. Entre las ventajas se encuentran que la tecnología permite la comunicación, aumenta la productividad económica, y fomenta nuevos modelos pedagógicos. Las desventajas incluyen problemas de seguridad para la sociedad, adicción a las herramientas tecnológicas, y contaminación ambiental.
Eddie Cairns has over 26 years of experience in installing, maintaining, and designing security systems. He currently works as a Security Systems Engineer for JBA Engineering, where he is responsible for project managing electronic security systems including CCTV, access control, fire, and intruder alarms. Previously, he held roles as a Service Engineer for Kings Security and ADT Fire and Security, where he installed and serviced security systems. Cairns has various qualifications in electrical engineering, business, and confined space entry.
This document provides a history of animation in Ireland from its origins to the present day. It discusses:
1) The pioneering early work of James Horgan in the 1900s, though his efforts proved to be a false dawn for animation in Ireland.
2) The establishment of animation studios in Ireland in the 1960s and 1980s, most notably Sullivan Bluth which became a major producer of animated features in the 1980s.
3) The contemporary Irish animation scene, which ranges from large studios to independent artists, and draws influence from both American and European styles.
4) Government support for animation through organizations like the Irish Film Board, which provides funding for shorts and features, helping build the industry.
Maphuti Mongatane has over 10 years of experience as an office manager. She is currently the office manager at Media 24 Lifestyle Department where she manages diaries, makes travel arrangements, files documentation, compiles presentations, processes invoices and payments, and more. Previously, she held office manager roles at Africa Extrabold/Ogilvy Advertising, Naledi Media24, and MCD Group. She has a secretarial diploma from Birnam Business College and an executive personal assistant certificate from Damelin.
Truong Nhat Ha Duyen is seeking an assistant director position. She has over 5 years of experience in roles including assistant director, English teacher, translator, and interpreter. She is proficient in English, Russian, Microsoft Office, and has strong communication and collaboration skills.
Este documento describe varias versiones del sistema operativo Windows de Microsoft. Describe las características de Windows XP como soporte para particiones más grandes, fácil reconocimiento de dispositivos y escritorio remoto. Luego resume las características de Windows 7 como mejoras en el reconocimiento de escritura, soporte para discos duros virtuales y mejor rendimiento. Finalmente resume que Windows 8 está diseñado para dispositivos táctiles y tiene un nuevo menú de inicio e interfaz.
Maphuti Mongatane has over 10 years of experience as an office manager. She is currently the office manager at Media 24 Lifestyle Department where she manages diaries, makes travel arrangements, files documentation, compiles presentations, processes invoices and payments, and more. Previously, she held office manager roles at Africa Extrabold/Ogilvy Advertising, Naledi Media24, and MCD Group. She has a secretarial diploma from Birnam Business College and an executive personal assistant certificate from Damelin.
FSLogix BriForum 2015 - Ending the Folder Redirection DebateFSLogix
FSLogix Apps 2.0 "Profile Containers" eliminate the need for legacy profile management and folder redirection. This new architectural approach can improve user logon time by 90%+, reduce server queue loading and network traffic, and allow OSTs to be put back in the profile without any lag time or email performance issues.
Circumnavigating the Antarctic with Python and Django during ACE 2016Carles Pina Estany
What we did, learnt, how we build the data science system with Python and Django during the Antarctic Circumnavigation Expedition in 2016. Presented in Pycon-UK 2017.
This document discusses various ways that Microsoft Azure storage and services can be used to enhance SQL Server deployments and provide additional capabilities. It begins with an introduction to Azure Blob storage and how it can be used for backups and disaster recovery. It then covers managed backups in Azure, using Azure storage for database files, configuring a cloud witness for failover clusters, hybrid partitioning to archive old data to Azure, stretch databases as an archival solution, and creating availability group replicas in Azure. The presentation provides examples and demos of configuring each of these capabilities with Azure and SQL Server.
DMU is the new tool introduced by Oracle for database conversion to the Unicode character set. Beside introducing briefly the tool, this session will focus on a real database conversion scenario faced by a customer, the problems encountered and the solutions.
Scale-Out Using Spark in Serverless Herd Mode!Databricks
Spark is a beast of a technology and can do amazing things, especially with large datasets. But some big data pipelines require processing the data in small chunks and running them through a large Spark cluster can be inefficient and expensive.
Online classified web site Leboncoin.fr is one of the success stories of the French Web. 1/3 of the total internet population in France uses the site each month. The growth has been spectacular and swift, and was made possible by a robust and performant software platform. At the heart of the platform is a large PostgreSQL infrastructure, part of it running on some of the largest PC-class hardware available. In this presentation, we will show how we have grown our infrastructure. In particular, the amazing vertical scalability of PG will be showcased with hard numbers (IOPS, transactions/seconds, etc). We will also cover some of the hard lessons we have learned along the way, including near-disasters. Finally, we will look into how innovative features from the PostgreSQL ecosystem enable new approaches to our scalability challenge.
Storing data in windows server 2012 ssKamil Bączyk
This document provides an overview of file and storage services in Windows Server 2012 including Storage Spaces, Resilient File System (ReFS), and Data Deduplication. The presenter's name is Kamil Bączyk and he will demonstrate and discuss these features, their benefits and limitations, and new capabilities in Windows Server 2012 R2. The presentation will include demonstrations and allow time for questions.
This document discusses challenges with scaling applications and analyzing large volumes of data. It describes how problems have remained the same over 30 years, such as parsing, filtering, and analyzing large amounts of data, despite hardware advances. The document advocates using binary representations and serialization instead of standard Java objects to improve performance for tasks like data processing, distributed computing and analytics. It provides examples showing how this approach can significantly reduce latency and improve throughput.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Craftsy is the leading online destination for passionate makers to learn, create, and share. With online classes, popular supplies and indie patterns, over ten million creative enthusiasts are taking their skills to new heights. By working with AWS and using data transfer services, Crafty was able to bounce back from a massive storage outage that impacted numerous teams. By using AWS storage services, the company was able to minimize the outage and speed up data restore three-fold. Learn more by attending this session.
Spanner : Google' s Globally Distributed DatabaseAhmedmchayaa
Spanner is Google's globally distributed database that provides synchronous replication across data centers for strong consistency. It uses TrueTime to synchronize clocks across data centers and provide a consistent view of data to users. The architecture of Spanner involves splitting tables into shards called "splits" that are replicated across multiple zones for high availability. Transactions in Spanner are globally consistent yet remain highly available and partition tolerant, making Spanner a CA (Consistent and Available) system according to the CAP theorem.
PGDAY FR 2014 : presentation de Postgresql chez leboncoin.frjlb666
This document discusses the use of PostgreSQL in Schibsted Classified Media's platform. Some key points:
- SCM uses PostgreSQL across 30+ countries, with over 100 servers storing 8TB of data and handling 50 million classified ads.
- Leboncoin.fr, the French classifieds site, is powered by this PostgreSQL-based platform. It receives 250 million page views per day from 5 million unique visitors.
- The database infrastructure includes high-performance servers with 2TB of RAM storing the 6TB production database. The read workload is offloaded to multiple read-only slaves.
- Despite caching, the master database still handles 600 transactions per second. Future scalability improvements may include sharding
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
by Chris Proto, DevOps Engineer, Craftsy
Craftsy is the leading online destination for passionate makers to learn, create, and share. With online classes, popular supplies and indie patterns, over ten million creative enthusiasts are taking their skills to new heights. By working with AWS and using data transfer services, Crafty was able to bounce back from a massive storage outage that impacted numerous teams. By using AWS storage services, the company was able to minimize the outage and speed up data restore three-fold. Learn more by attending this session.
NHK Challenges for Preserving UHD Materials (HIRAKAZU)FIAT/IFTA
The document discusses challenges for preserving ultra-high definition (UHD) video materials at NHK, including 8K video. It notes the large file sizes of raw and compressed 8K video formats. NHK's current process for 8K production involves creating 2K proxy files for editing and higher resolution files for archiving. However, scaling archiving to store increasing amounts of 4K/8K content raises capacity issues. NHK is considering duplicating archives to cloud storage and testing migrating servers to the cloud to address these challenges.
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra.
This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
Database Configuration for Maximum SharePoint 2010 PerformanceEdwin M Sarmiento
Database configuration has a direct impact on how SharePoint 2010 performs. This presentation looks at the SQL Server database and what configuration changes can be made to maximize performance for your SharePoint 2010 farms
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising.
Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster.
This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Summit
Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Life on a_rollercoaster
1. Life on a rollercoaster
Scaling the PostgreSQL backup and recovery
Federico Campoli
Transferwise
2 November 2016
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 1 / 52
2. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 2 / 52
3. Warning!
The story you are about to hear is true.
Only the names have been changed to protect the innocent.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 3 / 52
4. Dramatis personae
A brilliant startup ACME
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 4 / 52
5. Dramatis personae
The clueless engineers CE
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 5 / 52
6. Dramatis personae
An elephant on steroids PG
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 6 / 52
7. Dramatis personae
The big cheese HW
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 7 / 52
8. Dramatis personae
The real hero DBA
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 8 / 52
9. In the beginning
Our story starts in the year 2012. The world was young and our DBA started a
new brilliant career in ACME.
After the usual time required by the onboarding, to our DBA were handed the
production’s servers.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 9 / 52
10. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 10 / 52
11. 2012 - Who am I? What I’m doing?
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 11 / 52
12. Size does matter
PG, our powerful and friendly elephant was used for storing the data in a multi
shard configuration.
Not really big actually but very troubled indeed!
A small logger database - 50 GB
A larger configuration and auth datababase - 200 GB
Two archive db - 4 TB each
One db for the business intelligence - 2 TB
Each db had an hot standby counterpart hosted on less powerful HW.
Our story tells the life of the BI database.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 12 / 52
13. The carnival of monsters
In the early 2013 our brave DBA addressed the several problems found on the
current backup and recovery configuration.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 13 / 52
14. Lagging standby
Suboptimal schema.
Churn on large tables and high wal generation rate
The slave lagged just because there was autovacuum running
rsync used in archive command.
The wal were archived over the network using rsync+ssh
The *.ready files in the pg xlog increased the risk of the cluster’s crash.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 14 / 52
15. Base backup
Rudimentary init standby script.
Just a pg start backup call followed by a rsync between the master and the
slave
The several tablespaces were synced using a single rsync process with the
–delete option
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 15 / 52
16. Slow dump
Remote pg dump.
Each cluster was dumped remotely on a separate server using the custom
format.
The backup server had limited memory and cpu
Dump time between 3 hours and 2 days depending on the database size
The BI database was dumped on a daily basis, taking 14/18 hours.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 16 / 52
17. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 17 / 52
18. 2013 - On thin ice
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 18 / 52
19. Parallel rsync
Our DBA took the baby step approach. He started fixing one issue at a time
without affecting ACME’s activity.
The first enhancement was the init backup script.
Two bash arrays listed the origin and destination’s tablespaces
An empty bash array stored the rsync pids
The script started the pg start backup
For each tablespace a rsync process were spawned and the pid was stored in
the third array
A loop checked that the pids were present in the process list
When all the rsync finished pg stop backup were executed
An email to DBA was sent to tell him to start the slave
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 19 / 52
20. Local archive
The high rate of wal generation required a different archive strategy.
The archive command changed to a local copy
A simple rsync script copied every minute the archives to the slave
The script queried remotely the slave for the last restartpoint
The restartpoint was used by pg archivecleanup on the master
Implementing this solution solved the *.ready files problem but the autovacuum
still caused high lag.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 20 / 52
21. Autovacuum tune down
DBA investigated the autovacuum issue and finally addressed the cause.
The high lag on the slave was caused when autovacuum (or vacuum) hit a table
concurrently updated. This behaviour is normal and is caused by the standby
code’s design.
With large denormalised tables which are updated constantly the only workaround
possible was to increase autovacuum cost delay with a large value (1 second or
more).
When the autovacuum process reached an arbitrary cost during the execution
there was a 1 minute sleep before the activity resumed.
.
The lag on the standbys disappeared at the cost of longer autovacuum runs.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 21 / 52
22. In the meanwhile...
The CE decided to shard the business intelligence database using the hot
standby copy
The three new databases initially had the same amount of data which was
slowly cleaned up later
But even with one third of data on each shard, the daily dump was really
slow at the point of overlapping over the 24 hours
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 22 / 52
23. A slowish dump
pg dump connects to the running cluster like any other backend; it pulls out
data using in the copy format
With the custom format the compression happens on the server where
pg dump runs
The backup server were hammered on the network and the cpu
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 23 / 52
24. You got speed!
Our DBA wrote a bash script doing the following steps
Dump the database in custom format locally
Generate the file’s md5 checksum
Ship the file on the backup via rsync
Check the remote file’s md5
Send a message to nagios for success or failure
The backup time per each cluster dropped dramatically to just 5 hours including
the copy and the checksum verification.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 24 / 52
25. Growing pains
Despite the business growth the CE ignored the problems with the poor schema
design.
Speed was achieved by brute force using expensive SSD storage
The amount of data store in the BI db increased
The only accepted solution was to create new shards over and over again
By the end of 2013 the BI databases total size was 15 TB
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 25 / 52
26. In the meanwhile...
Our DBA upgraded all the PG to the version 9.2 with pg upgrade
THANKS BRUCE!!!!!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 26 / 52
27. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 27 / 52
28. 2014 - The battle of five armies
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 28 / 52
29. Bloating data
Q1 2014 opened with another backup performance issue
The dump size increased over the time
The database CPU usage increased constantly with no apparent reason
Most of the shards had the tablespace usage at 90%
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 29 / 52
30. Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
31. Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
The table had an harmless hstore field
Where everybody added new keys just changing the app code
And nobody did housekeeping of their data
The row length jumped from 200 bytes to 1200 bytes in few months
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
32. Mostly harmless
Against all odds our DBA addressed the issue.
The table used by the BI database was causing the bloat
The table’s design was technically a materalised view
The table was partitioned in some way
The table had an harmless hstore field
Where everybody added new keys just changing the app code
And nobody did housekeeping of their data
The row length jumped from 200 bytes to 1200 bytes in few months
Each BI shard contained up to 2 billion rows...
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 30 / 52
33. Just a little trick
Despite the impending doom and the CE resistance DBA succeeded in converting
the hstore field to a conventional columnar store (SORRY OLEG!).
The storage usage dropped by 30%
The CPU usage dropped by 60%
The speed of ACME’s product boosted
ACME saved $BIG BUNCH OF MONEY in new HW otherwise required to
shard again the dying databases
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 31 / 52
34. In the meanwhile...
DBA knew the fix was just a workaround
He asked the CE to help him in the schema redesign
He told them things would be problematic again in just one year
Nobody listened
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 32 / 52
35. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 33 / 52
36. 2015 - THIS IS SPARTA!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 34 / 52
37. I hate to say that, but I told you so
As predicted by our DBA the time required for backing up the BI databases
increased again, approaching dangerously the 24 hours.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 35 / 52
38. Parallel is da way!
Back in the 2013 PG 9.3 added the parallel export. But, to DBA great
disappointment, the version 9.3 was initially cursed by some bugs causing data
corruption. DBA could not use the parallel dump.
However...
The parallel backup takes advantage of the snapshot export introduced in the
PG 9.2
The debian packaging allows different PG’s major versions on the same
machine
DBA installed the client 9.3 and used its pg dump to dump the 9.2 in parallel
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 36 / 52
39. It worked very well...
The wrapper script required some adjustments
Accept the -j parameter
Check if the 9.3+ client is installed
Override the format to directory if the parallel backup is possible
Adapt the checksum procedure to check the files in the dump directory
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 37 / 52
40. ...with just a little infinitesimal catch
All fine right?
Not exactly
The restore test complained about the unknown parameter lock timeout
The backup hit the speed record since 2013
The schema was still the same of 2013
The databases performance were massively affected with 6 parallel jobs
DBA found that with just 4 parallel jobs the databases worked with minimal
disruption
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 38 / 52
41. In the meanwhile...
Our DBA upgraded PG to the latest version 9.4.
THANK YOU AGAIN BRUCE!!!!!
No more errors for the restore test.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 39 / 52
42. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 40 / 52
43. 2016 - Breathe
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 41 / 52
44. A new hope
The upgrade to PG 9.4 improved the performance issues and DBA had some time
to breath.
The script to ship the archived wal was improved to support multiple slaves
in cascading replica
Each slave had a dedicated rsync process configurable with compression and
protocol (rsync or rsync +ssh)
The script determined automatically the farthest slave querying the remote
controlfiles and cleaned the local archive accordingly
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 42 / 52
45. A new hope
The init standby script switched to the rsync protocol
The automated restore script used the ALTER SYSTEM added to the PG 9.4
to switch between the restore and production configuration
Therefore the restore time improved to at most 9 hours for the largest BI
database (4.5 TB)
Working with BOFH JR, DBA wrapped the backup script in the
$BACKUP MANAGER pre and post execution hooks
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 43 / 52
46. The rise of the machines
In the 2016 Q2, finally, DBA completed the configuration for $DEVOP TOOL and
deployed the several scripts to the 17 BI databases with minimal effort.
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 44 / 52
47. Table of contents
1 2012 - Who am I? What I’m doing?
2 2013 - On thin ice
3 2014 - The battle of five armies
4 2015 - THIS IS SPARTA!
5 2016 - Breathe
6 Wrap up
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 45 / 52
48. BI database at a glance
Year N. Databases Average size Total size Version
2012 1 2 TB 2 TB 9.1
2013 5 3 TB 15 TB 9.2
2014 9 2.2 TB 19 TB 9.2
2015 13 2.7 TB 32 TB 9.4
2016 16 2.5 TB 40 TB 9.4
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 46 / 52
49. Few words of wisdom
Reading the products source code is always a good practice.
Bad design can lead to disasters, in particular if the business is successful.
It’s never too early to book the CE to a SQL training course.
“One bad programmer can easily create two new jobs a year.” – David Parnas
If in doubt ask your DBA for advice.
If you don’t have a DBA, get one hired ASAP!
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 47 / 52
50. Did you say hire?
WE ARE HIRING!
https://transferwise.com/jobs/
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 48 / 52
52. Boring legal stuff
LAPD badge - source wikicommons
Montparnasse derailment - source wikipedia
Base jumper - copyright Chris McNaught
Disaster girl - source memegenerator
Blue elephant - source memecenter
Commodore 64 - source memecenter
Deadpool- source memegenerator
Thin ice - source Boating on Lake Winnebago
Boromir - source memegenerator
Sparta birds - source memestorage
Dart Vader - source memegenerator
Angry old man - source memegenerator
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 50 / 52
53. Contacts and license
Twitter: 4thdoctor scarf
Blog:http://www.pgdba.co.uk
Brighton PostgreSQL Meetup:
http://www.meetup.com/Brighton-PostgreSQL-Meetup/
This document is distributed under the terms of the Creative Commons
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 51 / 52
54. Life on a rollercoaster
Scaling the PostgreSQL backup and recovery
Federico Campoli
Transferwise
2 November 2016
Federico Campoli (Transferwise) Life on a rollercoaster 2 November 2016 52 / 52