Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
Dealing with large databases is always a challenge. The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
PostgreSQL - backup and recovery with large databasesFederico Campoli
Life on a rollercoaster, backup and recovery with large databases
Dealing with large databases is always a challenge.
The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
The presentation is based on a real story. The names are changed in order to protect the innocents.
PostgreSQL is a wild beast and a wrong approach can become a slow descent into hell. We'll talk about common mistakes, confusing jargon, the online manual's lost pages (well hidden in the source code) and best practices to avoid headaches for your DBA.
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
Business intelligence and analytics is the core of any great company and Transferwise is not an exception.
The talk will start with a brief history on the legacy analytics implemented with MySQL and how we scaled up the performance using PostgreSQL. In order to get fresh data from the core MySQL databases in real time we used a modified version of pg_chameleon which also obfuscated the PII data.
The talk will also cover the challenges and the lesson learned by the developers and analysts when bridging MySQL with PostgreSQL.
Dealing with large databases is always a challenge. The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
PostgreSQL - backup and recovery with large databasesFederico Campoli
Life on a rollercoaster, backup and recovery with large databases
Dealing with large databases is always a challenge.
The backups and the HA procedures evolve meanwhile the database installation grow up over the time.
The talk will cover the problems solved by the DBA in four years of working with large databases, which size increased from 1.7 TB single cluster, up to 40 TB in a multi shard environment.
The talk will cover either the disaster recovery with pg_dump and the high availability with the log shipping/streaming replication.
The presentation is based on a real story. The names are changed in order to protect the innocents.
PostgreSQL is a wild beast and a wrong approach can become a slow descent into hell. We'll talk about common mistakes, confusing jargon, the online manual's lost pages (well hidden in the source code) and best practices to avoid headaches for your DBA.
Slides from the Brighton PostgreSQL meetup presentation. An all around PostgreSQL exploration. The rocky physical layer, the treacherous MVCC’s swamp and the buffer manager’s garden.
pg_chameleon is a lightweight replication system written in python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The author's tool will talk about the history, the logic behind the functions available and will give an interactive usage example.
pg_chameleon is a lightweight replication system written in
python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The history, the logic and the future of the tool.
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
Whether the user needs to setup a permanent replica between MySQL and PostgreSQL or perform an engine migration, pg_chamaleon is the perfect tool for the job.
The talk will cover the history the current implementation and the future releases.
The audience will learn how to setup a replica from MySQL to PostgreSQL in few easy steps. There will be also a coverage on the lessons learned during the tool’s development cycle.
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseNikolay Samokhvalov
Future database administration will be highly automated. Until then, we still live in a world where extensive manual interactions are required from a skilled DBA. This will change soon as more "autonomous databases" reach maturity and enter the production environment.
Postgres-specific monitoring tools and systems continue to improve, detecting and analyzing performance issues and bottlenecks in production databases. However, while these tools can detect current issues, they require highly-experienced DBAs to analyze and recommend mitigations.
In this session, the speaker will present the initial results of the POSTGRES.AI project – Nancy CLI, a unified way to manage automated database experiments. Nancy CLI is an automated database management framework based on well-known open-source projects and incorporating major open-source tools and Postgres modules: pgBadger, pg_stat_kcache, auto_explain, pgreplay, and others.
Originally developed with the goal to simulate various SQL query use cases in various environments and collect data to train ML models, Nancy CLI turned out to be very a universal framework that can play a crucial role in CI/CD pipelines in any company.
Using Nancy CLI, casual DBAs and any engineers can easily conduct automated experiments today, either on AWS EC2 Spot instances or on any other servers. All you need is to tell Nancy which database to use, specify workload (synthetic or "real", generated based on the Postgres logs), and what you want to test – say, check how a new index will affect all most expensive query groups from pg_stat_statements, or compare various values of "default_statistics_target". All the collected information with a very high level of confidence will give you understanding, how various queries and overall Postgres performance will be affected when you apply this change to production.
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionParis Carbone
Large-scale data stream processing has come a long way to where it is today. It combines all the essential requirements of modern data analytics: subsecond latency, high throughput and impressively, strong consistency. Apache Flink is a system that serves as a proof-of-concept of these characteristics and it is mainly well-known for its lightweight fault tolerance. Data engineers and analysts can now let the system handle Terabytes of computational state without worrying about failures that can potentially occur.
This presentation describes all the fundamental challenges behind exactly-once processing guarantees in large-scale streaming in a simple and intuitive way. Furthermore, it demonstrate the basic and extended versions of Flink's state-of-the-art snapshotting algorithm tailored to the needs of a dataflow graph.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
DataPlotly is a plugin for QGIS that allows to create D3 like plots from spatial data. It is build on top of plotly, a javascript library which offers easy API for many languages such as Python, R, Matlab and NodeJS.
The plugin was created back in 2017 for the upcoming QGIS 3 version: today the plugin has been downloaded more than 50,000 times.
Creating plots is out of the main scopes of QGIS but thanks to the simple Python API it is easy enough to create additional scripts and plugins. Thanks to these APIs, DataPlotly is today a well maintained Python plugin with a growing community of developers, users and testers.
DataPlotly plots are completely interactive so that plot elements are directly linked with map items; therefore the user is able to query map items from the main plot canvas.
Thanks to a crowdfunding campaign launched in March 2019 during the annual QGIS User Conference, the functionalities of DataPlotly were extended: a complete refactoring of the code, more plots but especially the creation of plots in the layout composer.
More and more people are using the plugin to analyze the data and to create complex output reports of data (e.g. the Covid-19 pandemic
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
This presentation was delivered during Softshake 2013. Learn why RDBMS are not enought and why NoSQL help developers to scale their applications and provide agility.
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBWalter Heck
Nick Lewis, who came down to Singapore all the way from the Puppet Labs headquarters in Portland, Oregon, is one of the first developers at Puppet Labs and also actively develops Puppet DB. He gave a very interesting talk and demonstration about how Puppet DB work as well as its latest updates.
Outrageous ideas for Graph Databases
Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.
Slides from the Brighton PostgreSQL meetup presentation. An all around PostgreSQL exploration. The rocky physical layer, the treacherous MVCC’s swamp and the buffer manager’s garden.
pg_chameleon is a lightweight replication system written in python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The author's tool will talk about the history, the logic behind the functions available and will give an interactive usage example.
pg_chameleon is a lightweight replication system written in
python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The history, the logic and the future of the tool.
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.
Whether the user needs to setup a permanent replica between MySQL and PostgreSQL or perform an engine migration, pg_chamaleon is the perfect tool for the job.
The talk will cover the history the current implementation and the future releases.
The audience will learn how to setup a replica from MySQL to PostgreSQL in few easy steps. There will be also a coverage on the lessons learned during the tool’s development cycle.
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseNikolay Samokhvalov
Future database administration will be highly automated. Until then, we still live in a world where extensive manual interactions are required from a skilled DBA. This will change soon as more "autonomous databases" reach maturity and enter the production environment.
Postgres-specific monitoring tools and systems continue to improve, detecting and analyzing performance issues and bottlenecks in production databases. However, while these tools can detect current issues, they require highly-experienced DBAs to analyze and recommend mitigations.
In this session, the speaker will present the initial results of the POSTGRES.AI project – Nancy CLI, a unified way to manage automated database experiments. Nancy CLI is an automated database management framework based on well-known open-source projects and incorporating major open-source tools and Postgres modules: pgBadger, pg_stat_kcache, auto_explain, pgreplay, and others.
Originally developed with the goal to simulate various SQL query use cases in various environments and collect data to train ML models, Nancy CLI turned out to be very a universal framework that can play a crucial role in CI/CD pipelines in any company.
Using Nancy CLI, casual DBAs and any engineers can easily conduct automated experiments today, either on AWS EC2 Spot instances or on any other servers. All you need is to tell Nancy which database to use, specify workload (synthetic or "real", generated based on the Postgres logs), and what you want to test – say, check how a new index will affect all most expensive query groups from pg_stat_statements, or compare various values of "default_statistics_target". All the collected information with a very high level of confidence will give you understanding, how various queries and overall Postgres performance will be affected when you apply this change to production.
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionParis Carbone
Large-scale data stream processing has come a long way to where it is today. It combines all the essential requirements of modern data analytics: subsecond latency, high throughput and impressively, strong consistency. Apache Flink is a system that serves as a proof-of-concept of these characteristics and it is mainly well-known for its lightweight fault tolerance. Data engineers and analysts can now let the system handle Terabytes of computational state without worrying about failures that can potentially occur.
This presentation describes all the fundamental challenges behind exactly-once processing guarantees in large-scale streaming in a simple and intuitive way. Furthermore, it demonstrate the basic and extended versions of Flink's state-of-the-art snapshotting algorithm tailored to the needs of a dataflow graph.
Compare and contrast RDF triple stores and NoSQL: are triples stores NoSQL or not?
Talk given 2011-09-08 tot he BigData/NoSQL meetup at Bristol University.
DataPlotly is a plugin for QGIS that allows to create D3 like plots from spatial data. It is build on top of plotly, a javascript library which offers easy API for many languages such as Python, R, Matlab and NodeJS.
The plugin was created back in 2017 for the upcoming QGIS 3 version: today the plugin has been downloaded more than 50,000 times.
Creating plots is out of the main scopes of QGIS but thanks to the simple Python API it is easy enough to create additional scripts and plugins. Thanks to these APIs, DataPlotly is today a well maintained Python plugin with a growing community of developers, users and testers.
DataPlotly plots are completely interactive so that plot elements are directly linked with map items; therefore the user is able to query map items from the main plot canvas.
Thanks to a crowdfunding campaign launched in March 2019 during the annual QGIS User Conference, the functionalities of DataPlotly were extended: a complete refactoring of the code, more plots but especially the creation of plots in the layout composer.
More and more people are using the plugin to analyze the data and to create complex output reports of data (e.g. the Covid-19 pandemic
Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall
This presentation was delivered during Softshake 2013. Learn why RDBMS are not enought and why NoSQL help developers to scale their applications and provide agility.
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBWalter Heck
Nick Lewis, who came down to Singapore all the way from the Puppet Labs headquarters in Portland, Oregon, is one of the first developers at Puppet Labs and also actively develops Puppet DB. He gave a very interesting talk and demonstration about how Puppet DB work as well as its latest updates.
Outrageous ideas for Graph Databases
Almost every graph database vendor raised money in 2021. I am glad they did, because they are going to need the money. Our current Graph Databases are terrible and need a lot of work. There I said it. It's the ugly truth in our little niche industry. That's why despite waiting for over a decade for the "Year of the Graph" to come we still haven't set the world on fire. Graph databases can be painfully slow, they can't handle non-graph workloads, their APIs are clunky, their query languages are either hard to learn or hard to scale. Most graph projects require expert shepherding to succeed. 80% of the work takes 20% of the time, but that last 20% takes forever. The graph database vendors optimize for new users, not grizzly veterans. They optimize for sales not solutions. Come listen to a Rant by an industry OG on where we could go from here if we took the time to listen to the users that haven't given up on us yet.
This is the presentation for the talk I gave at JavaDay Kiev 2015. This is about an evolution of data processing systems from simple ones with single DWH to the complex approaches like Data Lake, Lambda Architecture and Pipeline architecture
Online classified web site Leboncoin.fr is one of the success stories of the French Web. 1/3 of the total internet population in France uses the site each month. The growth has been spectacular and swift, and was made possible by a robust and performant software platform. At the heart of the platform is a large PostgreSQL infrastructure, part of it running on some of the largest PC-class hardware available. In this presentation, we will show how we have grown our infrastructure. In particular, the amazing vertical scalability of PG will be showcased with hard numbers (IOPS, transactions/seconds, etc). We will also cover some of the hard lessons we have learned along the way, including near-disasters. Finally, we will look into how innovative features from the PostgreSQL ecosystem enable new approaches to our scalability challenge.
Full Video: https://www.youtube.com/watch?v=cOShsisEsC0
An overview of the relation and combination of three data processing paradigms that is becoming more relevant today. It introduces the essentials of graph, distributed and stream computing and beyond. Furthermore, it questions the fundamental problems that we want to solve with data analysis and the potential of eventually saving the human kind in the next millennium by improving the state of the art of computation technologies while being too busy answering first world problem questions. Crazy but possible.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
A video conversation with performance and capacity management veteran, Boris Zibitsker, about how I saved a multi-million dollar computing platform, using a 1-line performance model (at 21:50 minutes). "Best practices" caused the problem.
When you're starting out, staying nimble is paramount. This talk will help you push back infrastructure decisions and keep your options open by storing machine-readable JSON logs, not relational data, and processing them as an event stream to construct your data model on the fly.
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn
How Apache Arrow and Apache Parquet are helpful technologies to connect the Python Data ecosystem to other landscapes such as the Java/Scala based Big Data ecosystem.
A Front-Row Seat to Ticketmaster’s Use of MongoDBMongoDB
Ticketmaster is the world leader in selling tickets. After more than a decade of developing applications extensively on Oracle and MySQL, Ticketmaster made the move to MongoDB. The reasons for the move are generally in line with those of other companies – increased flexibility and performance, and decreased costs and time-to-market. In this session we’ll discuss how the conversion to MongoDB went at Ticketmaster and we’ll take a deeper dive into some of the successes and set-backs that we faced. We’ll give an overview of the MongoDB topology at Ticketmaster, discuss exactly what data we’re writing to MongoDB and comment on the MongoDB support model that we’re using. We’ll also touch on the transition from relational DBA to NoSQL DBA at Ticketmaster.
A presentation on the selection criteria, testing + evaluation and successful, zero-downtime migration to MongoDB. Additionally details on Wordnik's speed and stability are covered as well as how NoSQL technologies have changed the way Wordnik scales.
A short talk about Spark Structured Streaming and how it helps to process 35 billion of events per day with no headache.
Speaker - Ivan Kosianenko, Software Architect @ AppsFlyer Data Group.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
The ninja elephant, scaling the analytics database in Transwerwise
1. The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
Transferwise
3rd February 2017
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 1 / 56
2. First rule about talks, don’t talk about the speaker
Born in 1972
Passionate about IT since 1982 mostly because of the TRON movie
Joined the Oracle DBA secret society in 2004
In love with PostgreSQL since 2006
Currently runs the Brighton PostgreSQL User group
Works at Transferwise as Data Engineer
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 2 / 56
3. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 3 / 56
4. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 4 / 56
5. We have an appointment, and we are late!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 5 / 56
6. The Gordian Knot of analytics db
I started the data engineer job in July 2016
I was involved in a task not customer facing
However the task was very critical to the business
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
7. The Gordian Knot of analytics db
I started the data engineer job in July 2016
I was involved in a task not customer facing
However the task was very critical to the business
I had to fix the performance issues on the MySQL analytics database
Which performed bad, despite the considerable resources assigned to the VM
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
8. Tactical assessment
The existing database had the following configuration
MySQL 5.6
Innodb buffer size 60 GB
70 GB RAM
20 CPU
database size 600 GB
Looker and Tableau for running the analytic queries
The main live database replicated into the analytics database
Several schema from the service database imported on a regular basis
One schema used for obfuscating PII and denormalising the heavy queries
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 7 / 56
9. The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
10. The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
The performance issues worsened over a two years span
The obfuscation was made via custom views
The data size on the MySQL master increased over time
Causing the optimiser to switch on materialise when accessing the views
The analytics tools struggled just under normal load
In busy periods the database became almost unusable
Analysts were busy to tune existing queries rather writing new
A new solution was needed
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
11. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 9 / 56
12. The eye of the storm
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 10 / 56
13. One size doesn’t fits all
It was clear that MySQL was no longer a good fit.
However the new solution’s requirements had to meet some specific needs.
Data updated in almost real time from the live database
PII obfuscated for the analysts
PII available in clear for the power users
The system should be able to scale out for several years
Modern SQL for better analytics queries
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 11 / 56
14. May the best database win
The analysts team shortlisted few solutions.
Each solution covered partially the requirements.
Google BigQuery
Amazon RedShift
Snowflake
PostgreSQL
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
15. May the best database win
The analysts team shortlisted few solutions.
Each solution covered partially the requirements.
Google BigQuery
Amazon RedShift
Snowflake
PostgreSQL
Google BigQuery and Amazon RedShift did not suffice the analytics requirements
and were removed from the list.
Both PostgreSQL and Snowflake offered very good performance and modern SQL.
Neither of them offered a replication system from the MySQL system.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
16. Straight into the cloud
Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and
comes with different sizing.
Their pricing system is very appealing and the preliminary tests shown Snowflake
outperforming PostgreSQL1
.
1PostgreSQL single machine vs cloud based parallel processing
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 13 / 56
17. Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
18. Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Unfortunately there was just one little catch.
There was no support for obfuscation.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
19. Customer comes first
In Transferwise we really care about the customer’s data security.
Our policy for the PII data is that any personal information moving outside our
perimeter shall be obfuscated.
In order to be compliant the database accessible by Fivetran would have only
obfuscated data.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 15 / 56
20. Proactive development
The sense of DBA tingled. I foresaw the requirement and in my spare time I built
a proof of concept based on the replica tool pg chameleon.
The tool which using a python library can replicate a MySQL database into
PostgreSQL.
The initial tests on a reduced dataset were successful.
It was simple to add the obfuscation in real time with minimal changes.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 16 / 56
21. And the winner is...
The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran.
However, because the performance on PostgreSQL were quite good, and the
system have good margin for scaling up, the decision was to keep the data
analytics data behind our perimeter.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
22. And the winner is...
The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran.
However, because the performance on PostgreSQL were quite good, and the
system have good margin for scaling up, the decision was to keep the data
analytics data behind our perimeter.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
23. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 18 / 56
24. MySQL Replica in a nutshell
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 19 / 56
25. A quick look to the replication system
Let’s have a quick overview on how the MySQL replica works and how the
replicator interacts with it.
The following slides explain how pg chameleon works because the custom
obfuscator tool shares with pg chameleon most concepts concepts and code.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 20 / 56
26. MySQL Replica
The MySQL replica protocol is logical
When MySQL is configured properly the RDBMS saves the data changed
into binary log files
The slave connects to the master and gets the replication data
The replication’s data are saved into the slave’s local relay logs
The local relay logs are replayed into the slave
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 21 / 56
28. A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
29. A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
30. A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Initialises the PostgreSQL replica schema in just one command
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
31. MySQL replica + pg chameleon
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 24 / 56
32. Log formats
MySQL supports different formats for the binary logs.
The STATEMENT format. It logs the statements which are replayed on the
slave.
It seems the best solution for performance.
However replaying queries with not deterministic elements generate
inconsistent slaves (e.g. insert with uuid).
The ROW format is deterministic. It logs the row image and the DDL queries.
This is the format required for pg chameleon to work.
MIXED takes the best of both worlds. The master logs the statements unless
a not deterministic element is used. In that case it logs the row image.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 25 / 56
33. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 26 / 56
34. How we did it
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 27 / 56
35. Replica and obfuscation
I built a minimum viable product for pg chameleon.
The project was forked into a transferwise owned repository for the customisation.
It were added the the obfuscation capabilities and other specific procedures like
the daily data aggregation.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 28 / 56
36. Mighty morphing power elephant
The replica initialisation locks the mysql tables in read only mode.
To avoid the main database to be locked for several hours a secondary MySQL
replica is setup with the local query logging enabled.
The cascading replica also allowed to use the ROW binlog format as the master
uses MIXED for performance reasons.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 29 / 56
37. This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
38. This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave logs the row changes locally in ROW format
PostgreSQL reads the slave’s replica and obfuscates the data in realtime!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
39. This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave logs the row changes locally in ROW format
PostgreSQL reads the slave’s replica and obfuscates the data in realtime!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
40. Replica initialisation
The replica initialisation follows the same rules of any mysql replica setup
Flush the tables with read lock
Get the master’s coordinates
Copy the data
Release the locks
The procedure pulls the data out from mysql using the CSV format for a fast load
in PostgreSQL with the COPY command.
This approach requires with a tricky SQL statement.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 31 / 56
41. First generate the select list
SELECT
CASE
WHEN data_type="enum"
THEN
SUBSTRING(COLUMN_TYPE ,5)
END AS enum_list ,
CASE
WHEN
data_type IN (’"""+" ’,’". join(self.hexify)+""" ’)
THEN
concat(’hex(’,column_name ,’)’)
WHEN
data_type IN (’bit ’)
THEN
concat(’cast(‘’,column_name ,’‘ AS unsigned)’)
ELSE
concat(’‘’,column_name ,’‘’)
END
AS column_csv
FROM
information_schema .COLUMNS
WHERE
table_schema =%s
AND table_name =%s
ORDER BY
ordinal_position
;
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 32 / 56
42. Then use it into mysql query
csv_data=""
sql_out="SELECT "+columns_csv+" as data FROM "+table_name+";"
self.mysql_con.connect_db_ubf()
try:
self.logger.debug("Executing query for table %s" % (table_name, ))
self.mysql_con.my_cursor_ubf.execute(sql_out)
except:
self.logger.debug("an error occurred when pulling out the data from the table %s - sql executed: %s" % (table_name, sql_out))
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 33 / 56
43. Fallback on failure
The CSV data is pulled out in slices in order to avoid memory overload.
The file is then pushed into PostgreSQL using the COPY command.
However...
COPY is fast but is single transaction
One failure and the entire batch is rolled back
If this happens the procedure loads the same data using the INSERT
statements
Which can be very slow
But at least discards only the problematic rows
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 34 / 56
44. obfuscation setup
A simple yaml file is used to list table, column and obfuscation strategy
u s e r d e t a i l s :
last name :
mode : normal
nonhash start : 0
nonhash length : 0
phone number :
mode : normal
nonhash start : 1
nonhash length : 2
d a t e o f b i r t h :
mode : date
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 35 / 56
45. Obfuscation when initialising
The obfuscation process is quite simple and uses the extension pgcrypt for hashing
in sha256.
When the replica is initialised the data is copied into the schema in clear
The table locks are released
The tables with PII are copied and obfuscated in a separate schema
The process builds the indices on the schemas with data in clear and
obfuscated
The tables without PII data are exposed to the normal users using simple
views
All the varchar fields in the obfuscated schema are converted in text fields
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 36 / 56
46. Obfuscation on the fly
The obfuscation is also applied when the data is replicated.
The approach is very simple.
When a row image is captured the process checks if the table contains PII
data
In that case the process generates a second jsonb element with the PII data
obfuscated
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 37 / 56
48. The DDL. A real pain in the back
The DDL replica is possible with a little trick.
MySQL even in ROW format emits the DDL as statements
A regular expression traps the DDL like CREATE/DROP TABLE or ALTER
TABLE.
The mysql library gets the table’s metadata from the information schema
The metadata is used to build the DDL in the PostgreSQL dialect
This approach may not be elegant but is quite robust.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 39 / 56
49. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 40 / 56
52. Resource comparison
Resource MySQL PostgreSQL
Storage Size 940 GB 664 GB
Server CPUs 18 8
Server Memory 68 GB 48 GB
Shared Memory 50 GB 5 GB
Max connections 500 100
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 43 / 56
53. Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
54. Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
some code was optimised inside, but actually very little - maybe 10-20% was
improved. We’ll do more of that in the future, but not yet. The good thing is that
the performance gains we have can mostly be attributed just to PG vs MySQL. So
there’s a lot of scope to improve further.
Jeff McClelland - Growth Analyst, data guru
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
55. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 45 / 56
57. init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
odd.
Using not buffered cursors improves the speed and the memory usage.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
58. init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
odd.
Using not buffered cursors improves the speed and the memory usage.
However.... even after fixing the memory issues the initial copy took 6 days.
Tuning the copy speed with the unbuffered cursors and the row number estimates
improved the initial copy speed which now completes in 30 hours.
Including the time required for the index build.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
59. Strictness is an illusion. MySQL doubly so
MySQL’s lack of strictness is not a mystery.
The replica broke down several times because of the funny way the NOT NULL is
managed by MySQL.
To prevent any further replica breakdown the fields with NOT NULL added with
ALTER TABLE, in PostgreSQL are always as NULLable.
MySQL truncates the strings of characters at the varchar size automatically. This
is a problem if the field is obfuscated on PostgreSQL because the hashed string
could not fit into the corresponding varchar field. Therefore all the character
varying on the obfuscated schema are converted to text.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 48 / 56
60. I feel your lack of constraint disturbing
Rubbish data in MySQL can be stored without errors raised by the DBMS.
When this happens the replicator traps the error when the change is replayed on
PostgreSQL and discards the problematic row.
The value is logged on the replica’s log, available for further actions.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 49 / 56
61. Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 50 / 56
65. Contacts and license
Twitter: 4thdoctor scarf
Transferwise: https://transferwise.com/
Blog:http://www.pgdba.co.uk
Meetup: http://www.meetup.com/Brighton-PostgreSQL-Meetup/
This document is distributed under the terms of the Creative Commons
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 54 / 56
66. Boring legal stuff
The 4th doctor meme - source memecrunch.com
The eye, phantom playground, light end tunnel - Copyright Federico Campoli
The dolphin picture - Copyright artnoose
It could work. Young Frankenstein - source quickmeme
Deadpool Clap - source memegenerator
Deadpool Maximum Effort - source Deadpool Zoeiro
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 55 / 56
67. The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
Transferwise
3rd February 2017
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 56 / 56