In SQL joins are a fundamental concepts and many database engines have serious problems when it comes to joining many many tables.
PostgreSQL is a pretty cool database - the question is just: How many joins can it take?
It is known that Oracle does not accept insanely long queries and MySQL is known to core dump with 2000 tables.
This talk shows how to join 1 million tables with PostgreSQL.
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization.
#ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity
-----------------
Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-...
Check out more ClickHouse resources: https://altinity.com/resources/
Visit the Altinity Documentation site: https://docs.altinity.com/
Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/
Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/
----------------
Learn more about Altinity!
Site: https://www.altinity.com
LinkedIn: https://www.linkedin.com/company/alti...
Twitter: https://twitter.com/AltinityDB
This talk explores PostgreSQL 15 enhancements (along with some history) and looks at how they improve developer experience (MERGE and SQL/JSON), optimize support for backups and compression, logical replication improvements, enhanced security and performance, and more.
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfAltinity Ltd
Join the Altinity experts as we dig into ClickHouse sharding and replication, showing how they enable clusters that deliver fast queries over petabytes of data. We’ll start with basic definitions of each, then move to practical issues. This includes the setup of shards and replicas, defining schema, choosing sharding keys, loading data, and writing distributed queries. We’ll finish up with tips on performance optimization.
#ClickHouse #datasets #ClickHouseTutorial #opensource #ClickHouseCommunity #Altinity
-----------------
Join ClickHouse Meetups: https://www.meetup.com/San-Francisco-...
Check out more ClickHouse resources: https://altinity.com/resources/
Visit the Altinity Documentation site: https://docs.altinity.com/
Contribute to ClickHouse Knowledge Base: https://kb.altinity.com/
Join the ClickHouse Reddit community: https://www.reddit.com/r/Clickhouse/
----------------
Learn more about Altinity!
Site: https://www.altinity.com
LinkedIn: https://www.linkedin.com/company/alti...
Twitter: https://twitter.com/AltinityDB
This talk explores PostgreSQL 15 enhancements (along with some history) and looks at how they improve developer experience (MERGE and SQL/JSON), optimize support for backups and compression, logical replication improvements, enhanced security and performance, and more.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
JSON is the king of data formats and ClickHouse has a plethora of features to handle it. This webinar covers JSON features from A to Z starting with traditional ways to load and represent JSON data in ClickHouse. Next, we’ll jump into the JSON data type: how it works, how to query data from it, and what works and doesn’t work. JSON data type is one of the most awaited features in the 2022 ClickHouse roadmap, so you won’t want to miss out. Finally, we’ll talk about Jedi master techniques like adding bloom filter indexing on JSON data.
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
Webinar May 27, 2020
ClickHouse is famously fast, but a small amount of extra work makes it much faster. Join us for the latest version of our popular talk on single-node ClickHouse performance. We start by examining the system log to see what ClickHouse queries are doing. Then we introduce standard tricks to increase speed: adding CPUs, reducing I/O with filters, restructuring joins, adding indexes, and using materialized views, plus many more. In each case we show how to measure the results of your work. There will as usual be time for questions as well at the end. Sign up now to polish your ClickHouse performance skills!
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
Faster, better, stronger: The new InnoDBMariaDB plc
For MariaDB Enterprise Server 10.5, the default transactional storage engine, InnoDB, has been significantly rewritten to improve the performance of writes and backups. Next, we removed a number of parameters to reduce unnecessary complexity, not only in terms of configuration but of the code itself. And finally, we improved crash recovery thanks to better consistency checks and we reduced memory consumption and file I/O thanks to an all new log record format.
In this session, we’ll walk through all of the improvements to InnoDB, and dive deep into the implementation to explain how these improvements help everything from configuration and performance to reliability and recovery.
Materialized Views and Secondary Indexes in Scylla: They Are finally here!ScyllaDB
Materialized Views and Secondary Indexes are finally ready for prime time and is going GA. In this talk we will cover the unique aspects of the Scylla implementation and what you can expect to do with it.
ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd
Slides for the webinar, presented on February 26, 2020
By Robert Hodges, Altinity CEO
Materialized views are the killer feature of ClickHouse, and the Altinity 2019 webinar on how they work was very popular. Join this updated webinar to learn how to use materialized views to speed up queries hundreds of times. We'll cover basic design, last point queries, using TTLs to drop source data, counting unique values, and other useful tricks. Finally, we'll cover recent improvements that make materialized views more useful than ever.
PostgreSQL (or Postgres) began its life in 1986 as POSTGRES, a research project of the University of California at Berkeley.
PostgreSQL isn't just relational, it's object-relational.it's object-relational. This gives it some advantages over other open source SQL databases like MySQL, MariaDB and Firebird.
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021Altinity Ltd
ClickHouse is open source. You can build it yourself. What’s more, you can make it better! In this webinar, we’ll demonstrate how to pull the ClickHouse code from Github and build it. We’ll then walk through how to contribute a new feature to ClickHouse by developing, testing, and pushing a pull request through the community merge process. There will be demos and ample time for questions. Join us to get started as a ClickHouse developer!
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
By default PostgreSQL will store data on disk in its standard format. However, in many cases business or legal requirements require data to on disk be encrypted.
On disk-encryption was paid for by German industry and is a classical case showing how business and Open Source can coexist. This talk shows, that it can be actually cheaper to implement a feature into an Open Source product than to license a commercial product such as Oracle, DB2, Informix, or MS SQL Server.
Open Source can make a valuable contribution to save costs for everybody.
Performance Tuning muss in PostgreSQL nicht schwer sein. Oft reichen einige einfache Veränderung, um die Datenbank massiv zu beschleunigen.
VIele Performance Probleme sind einfach zu lösen. Diese Präsentation zeigt die gängigsten Methoden, um einfache Probleme schnell und effizient zu beseitigen
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
JSON is the king of data formats and ClickHouse has a plethora of features to handle it. This webinar covers JSON features from A to Z starting with traditional ways to load and represent JSON data in ClickHouse. Next, we’ll jump into the JSON data type: how it works, how to query data from it, and what works and doesn’t work. JSON data type is one of the most awaited features in the 2022 ClickHouse roadmap, so you won’t want to miss out. Finally, we’ll talk about Jedi master techniques like adding bloom filter indexing on JSON data.
This ppt was used by Devrim at pgDay Asia 2017. He talked about some important facts about WAL - Transaction Logs or xlogs in PostgreSQL. Some of these can really come handy on a bad day
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Altinity Ltd
Webinar May 27, 2020
ClickHouse is famously fast, but a small amount of extra work makes it much faster. Join us for the latest version of our popular talk on single-node ClickHouse performance. We start by examining the system log to see what ClickHouse queries are doing. Then we introduce standard tricks to increase speed: adding CPUs, reducing I/O with filters, restructuring joins, adding indexes, and using materialized views, plus many more. In each case we show how to measure the results of your work. There will as usual be time for questions as well at the end. Sign up now to polish your ClickHouse performance skills!
In 40 minutes the audience will learn a variety of ways to make postgresql database suddenly go out of memory on a box with half a terabyte of RAM.
Developer's and DBA's best practices for preventing this will also be discussed, as well as a bit of Postgres and Linux memory management internals.
Faster, better, stronger: The new InnoDBMariaDB plc
For MariaDB Enterprise Server 10.5, the default transactional storage engine, InnoDB, has been significantly rewritten to improve the performance of writes and backups. Next, we removed a number of parameters to reduce unnecessary complexity, not only in terms of configuration but of the code itself. And finally, we improved crash recovery thanks to better consistency checks and we reduced memory consumption and file I/O thanks to an all new log record format.
In this session, we’ll walk through all of the improvements to InnoDB, and dive deep into the implementation to explain how these improvements help everything from configuration and performance to reliability and recovery.
Materialized Views and Secondary Indexes in Scylla: They Are finally here!ScyllaDB
Materialized Views and Secondary Indexes are finally ready for prime time and is going GA. In this talk we will cover the unique aspects of the Scylla implementation and what you can expect to do with it.
ClickHouse Materialized Views: The Magic ContinuesAltinity Ltd
Slides for the webinar, presented on February 26, 2020
By Robert Hodges, Altinity CEO
Materialized views are the killer feature of ClickHouse, and the Altinity 2019 webinar on how they work was very popular. Join this updated webinar to learn how to use materialized views to speed up queries hundreds of times. We'll cover basic design, last point queries, using TTLs to drop source data, counting unique values, and other useful tricks. Finally, we'll cover recent improvements that make materialized views more useful than ever.
PostgreSQL (or Postgres) began its life in 1986 as POSTGRES, a research project of the University of California at Berkeley.
PostgreSQL isn't just relational, it's object-relational.it's object-relational. This gives it some advantages over other open source SQL databases like MySQL, MariaDB and Firebird.
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021Altinity Ltd
ClickHouse is open source. You can build it yourself. What’s more, you can make it better! In this webinar, we’ll demonstrate how to pull the ClickHouse code from Github and build it. We’ll then walk through how to contribute a new feature to ClickHouse by developing, testing, and pushing a pull request through the community merge process. There will be demos and ample time for questions. Join us to get started as a ClickHouse developer!
Spencer Christensen
There are many aspects to managing an RDBMS. Some of these are handled by an experienced DBA, but there are a good many things that any sys admin should be able to take care of if they know what to look for.
This presentation will cover basics of managing Postgres, including creating database clusters, overview of configuration, and logging. We will also look at tools to help monitor Postgres and keep an eye on what is going on. Some of the tools we will review are:
* pgtop
* pg_top
* pgfouine
* check_postgres.pl.
Check_postgres.pl is a great tool that can plug into your Nagios or Cacti monitoring systems, giving you even better visibility into your databases.
By default PostgreSQL will store data on disk in its standard format. However, in many cases business or legal requirements require data to on disk be encrypted.
On disk-encryption was paid for by German industry and is a classical case showing how business and Open Source can coexist. This talk shows, that it can be actually cheaper to implement a feature into an Open Source product than to license a commercial product such as Oracle, DB2, Informix, or MS SQL Server.
Open Source can make a valuable contribution to save costs for everybody.
Performance Tuning muss in PostgreSQL nicht schwer sein. Oft reichen einige einfache Veränderung, um die Datenbank massiv zu beschleunigen.
VIele Performance Probleme sind einfach zu lösen. Diese Präsentation zeigt die gängigsten Methoden, um einfache Probleme schnell und effizient zu beseitigen
PostgreSQL is one of the most advanced relational databases. It offers superb replication capabilities. The most important features are: Streaming replication, Point-In-Time-Recovery, advanced monitoring, etc.
In PostgreSQL kann man sich mit "explain" ansehen, welchen Execution Plan PostgreSQL für eine Query verwendet. Das hilft beim Suchen von Performance Problemen und hilft, den Durchsatz der Database zu steigern.
PostgreSQL is a strong relational database which is highly capable of replication. As of PostgreSQL 9.4 streaming replication is only able to replicate an entire database instance.
Walbouncer allows to filter the PostgreSQL WAL and partial replicate single databases.
This talk describes how somebody can write his or her own PostgreSQL custom aggregation functions.
Hans also shows, how to run windowing functions and analytics to handle data in PostgreSQL.
This one is about advanced indexing in PostgreSQL. It guides you through basic concepts as well as through advanced techniques to speed up the database.
All important PostgreSQL Index types explained: btree, gin, gist, sp-gist and hashes.
Regular expression indexes and LIKE queries are also covered.
This talk shows, how business intelligence can be performed with PostgreSQL. It features advanced SQL including aggregations, analytics as well as more sophisticated joins such as LATERAL.
In addition to that loading data as well as optimizing storage are discussed.
Presentation by Stefan Dziembowski, associate professor and leader of Cryptology and Data Security Group University of Warsaw. In BIU workshop on Bitcoin. Covered exclusively by vpnMentor.com
Presentation by Stefan Dziembowski, associate professor and leader of Cryptology and Data Security Group University of Warsaw. In BIU workshop on Bitcoin. Covered exclusively by vpnMentor.com
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
How to make a large C++-code base manageablecorehard_by
My talk will cover how to work with a large C++ code base professionally. How to write code for debuggability, how to work effectively even due the long C++ compilation times, how and why to utilize the STL algorithms, how and why to keep interfaces clean. In addition, general convenience methods like making wrappers to make the code less error prone (for example ranged integers, listeners, concurrent values). Also a little bit about common architecture patterns to avoid (virtual classes), and patterns to encourage (pure functions), and how std::function/lambda functions can be used to make virtual classes copyable.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
A presentation of adaptive classification and regression algorithms available in scikit-learn with a Focus on Stochastic Gradient Descent and KNN. Performance examples on 2 Large datasets are presented for SGD, Multinomial Naive Bayes, Perceptron and Passive Aggressive Algorithms.
Building a DSL with GraalVM (VoxxedDays Luxembourg)Maarten Mulders
GraalVM is a virtual machine that can run many languages on top of the Java Virtual Machine. It comes with support for JavaScript, Ruby, Python… But what if you're building a DSL, or your language is not listed? Fear not!
In this session we'll discover what it takes to run another language in GraalVM. Using GraalVM, we don't only get a fast runtime, but we'll also get great tool support. With Brainfuck as an example, we'll see how we can run guest languages inside Java applications. It might not bring us profit, but at least it will bring some fun.
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.
We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.
The talk will be aimed at programmers who want to better understand the performance characteristics of current big-data systems as well as their evolution. The following specific topics will be addressed:
1. The various types of indexes and their performance characteristics and trade-offs: hashing, sorted arrays, bitsets and so forth.
2. Index and table compression techniques: binary packing, patched coding, dictionary coding, frame-of-reference.
Queuing Sql Server: Utilise queues to increase performance in SQL ServerNiels Berglund
When you think of SQL Server, the first thing you think about is probably not SQL as host for messaging / queuing applications. However, in certain scenarios it definitely makes sense to implement messaging inside the SQL engine. In this session we will see the benefits of messaging applications inside SQL as well as what options you have when implementing it and their respective performance implications.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
2. Joining 1 million tables
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
3. The goal
Creating and joining 1 million tables
Pushing PostgreSQL to its limits (and definitely beyond)
How far can we push this without patching the core?
Being the first one to try this ;)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
4. Creating 1 million tables is easy, no?
Here is a script:
#!/usr/bin/perl
print "BEGIN;n";
for (my $i = 0; $i < 1000000; $i++)
{
print "CREATE TABLE t_tab_$i (id int4);n";
}
print "COMMIT;n";
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
5. Creating 1 million tables
Who expects this to work?
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
6. Well, it does not ;)
CREATE TABLE
CREATE TABLE
WARNING: out of shared memory
ERROR: out of shared memory
HINT: You might need to increase
max_locks_per_transaction.
ERROR: current transaction is aborted, commands ignored
until end of transaction block
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
7. Why does it fail?
There is a variable called max locks per transaction
max locks per transaction * max connections = maximum
number of locks
100 connection x 64 is by far not enough
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
8. A quick workaround
Use single transactions
Avoid nasty disk flushes on the way
SET synchronous_commit TO off;
time ./create_tables.pl | psql test > /dev/null
real 14m45.320s
user 0m45.778s
sys 0m18.877s
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
9. Reporting success . . .
test=# SELECT count(tablename)
FROM pg_tables
WHERE schemaname = ’public’;
count
---------
1000000
(1 row)
One million tables have been created
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
10. Creating an SQL statement
Writing this by hand is not feasible
Desired output:
SELECT 1
FROM t_tab_1, t_tab_2, t_tab_3, t_tab_4, t_tab_5
WHERE t_tab_1.id = t_tab_2.id AND
t_tab_2.id = t_tab_3.id AND
t_tab_3.id = t_tab_4.id AND
t_tab_4.id = t_tab_5.id AND
1 = 1;
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
11. A first attempt to write a script
#!/usr/bin/perl
my $joins = 5; my $sel = "SELECT 1 ";
my $from = ""; my $where = "";
for (my $i = 1; $i < $joins; $i++)
{
$from .= "t_tab_$i, n";
$where .= " t_tab_" . ($i) . ".id = t_tab_".
($i+1) . ".id AND n";
}
$from .= " t_tab_$joins ";
$where .= " 1 = 1; ";
print "$sel n FROM $from n WHERE $wheren";
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
12. 1 million tables: What it means
If you do this for 1 million tables . . .
perl create.pl | wc
2000001 5000003 54666686
54 MB of SQL is quite a bit of code
First observation: The parser works ;)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
13. Giving it a try
runtimes with PostgreSQL default settings:
n = 10: 158 ms
n = 100; 948 ms
n = 200: 3970 ms
n = 400: 18356 ms
n = 800: 87095 ms
n = 1000000: prediction = 1575 days :)
ouch, this does not look linear
1575 days was too close to the conference
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
14. GEQO:
At some point GEQO will kick in
Maybe it is the problem?
SET geqo TO off;
n = 10: 158 ms
n = 100: ERROR: canceling statement due to user request
Time: 680847.127 ms
It clearly is not . . .
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
15. It seems GEQO is needed
Let us see how far we can get by adjusting GEQO settings
SET geqo_effort TO 1;
- n = 10: 158 ms
- n = 100: 916 ms
- n = 200: 1464 ms
- n = 400: 4694 ms
- n = 800: 18896 ms
- n = 10000000: maybe 1 year? :)
=> the conference deadline is coming closer
=> still 999.000 tables to go ...
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
16. Trying with 10.000 tables
./make_join.pl 10000 | psql test
ERROR: stack depth limit exceeded
HINT: Increase the configuration parameter
max_stack_depth (currently 2048kB), after ensuring
the platform’s stack depth limit is adequate.
this is easy to fix
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
17. Removing limitations
Problem can be fixed easily
OS: ulimit -s unlimited
postgresql.conf: set max stack depth to a very high value
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
18. Next stop: Exponential planning time
Obviously planning time goes up exponentially
There is no way to do this without patching
As long as we are ways slower than linear it is impossible to
join 1 million tables.
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
19. This is how PostgreSQL joins tables:
Tables a, b, c, d form the following joins of 2:
[a, b], [a, c], [a, d], [b, c], [b, d], [c, d]
By adding a single table to each group, the following joins of 3 are
constructed:
[[a, b], c], [[a, b], d], [[a, c], b], [[a, c], d], [[a, d], b], [[a, d], c], [[b,
c], a], [[b, c], d], [[b, d], a], [[b, d], c], [[c, d], a], [[c, d], b]
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
20. More joining
Likewise, add a single table to each to get the final joins:
[[[a, b], c], d], [[[a, b], d], c], [[[a, c], b], d], [[[a, c], d], b], [[[a, d],
b], c], [[[a, d], c], b], [[[b, c], a], d], [[[b, c], d], a], [[[b, d], a], c],
[[[b, d], c], a], [[[c, d], a], b], [[[c, d], b], a]
Got it? ;)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
21. A little patching
A little patch ensures that the list of tables is grouped into
“sub-lists”, where the length of each sub-list is equal to
from collapse limit. If from collapse limit is 2, the list of 4
becomes . . .
[[a, b], [c, d]]
[a, b] and [c, d] are joined separately, so we only get
one “path”:
[[a, b], [c, d]]
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
22. The downside:
This ensures minimal planning time but is also no real
optimization.
There is no potential for GEQO to improve things
Speed is not an issue in our case anyway. It just has to work
“somehow”.
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
23. Are we on a path to success?
Of course not ;)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
24. The next limitation
the join fails again
ERROR: too many range table entries
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
25. This time life is easy
Solution: in include/nodes/primnodes.h
#define INNER_VAR 65000
/* reference to inner subplan */
#define OUTER_VAR 65001
/* reference to outer subplan *
#define INDEX_VAR 65002
/* reference to index column */
to
#define INNER_VAR 2000000
#define OUTER_VAR 2000001
#define INDEX_VAR 2000002
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
26. Next stop: RAM starts to be an issue
| Tables (k) | Planning time (s) | Memory
|------------+-------------------+---------------------|
| 2 | 1 | 38.5 MB |
| 4 | 3 | 75.6 MB |
| 8 | 31 | 192 MB |
| 16 | 186 | 550 MB |
| 32 | 1067 | 1.7 GB |
(this is for planning only)
Expected amount of memory: not available ;)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
27. Is there a way out?
Clearly: The current path leads to nowhere
Some other approach is needed
Divide and conquer?
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
28. Changing the query
Currently there is . . .
SELECT 1 ...
FROM tab1, ..., tabn
WHERE ...
The problem must be split
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
29. CTEs might come to the rescue
How about creating WITH-clauses joining smaller sets of
tables?
At the end those WITH statements could be joined together
easily.
1.000 CTEs with 1.000 tables each gives 1.000.000 tables
Scary stuff: Runtime + memory Unforeseen stuff
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
30. A skeleton query
SET enable_indexscan TO off;
SET enable_hashjoin TO off;
SET enable_mergejoin TO off;
SET from_collapse_limit TO 2;
SET max_stack_depth TO ’1GB’;
WITH cte_0(id_1, id_2) AS (
SELECT t_tab_0.id, t_tab_16383.id
FROM t_tab_0, ...
cte_1 ...
SELECT ... FROM cte_0, cte_1 WHERE ...
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
31. Finally: A path to enlightenment
With all the RAM we had . . .
With all the patience we had . . .
With a fair amount of luck . . .
And with a lot of confidence . . .
Hans-J¨urgen Sch¨onig
www.postgresql-support.de
32. Finally: It worked . . .
Amount of RAM needed: around 40 GB
Runtime: 10h 36min 53 sec
The result:
?column?
----------
(0 rows)
Hans-J¨urgen Sch¨onig
www.postgresql-support.de