The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters to support:
- Interactive data analysis
- In-memory analytics processing
- Batch ETL
- Continuous ETL pipeline (streaming)
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Foundations of streaming SQL: stream & table theoryDataWorks Summit
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how can all of this work in a programmatic framework like Apache Beam? The presentation answers these questions and more as it walks you through key concepts underpinning data processing in general.
Presentation explores the relationship between the Beam model (as described in paper “The Dataflow Mode”and the “Streaming 101”and “Streaming 102” blog posts) and stream and table theory (as popularized by Martin Kleppmann and Jay Kreps, among others).
It turns out that stream and table theory does an illuminating job of describing the low-level concepts that underlie the Beam model.
The presentation explains what is required to provide robust stream processing support in SQL and discusses the concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, as well as new ideas yet to come. You’ll leave with a much better understanding of the key concepts underpinning data processing—regardless of whether that data processing is batch or streaming or SQL or programmatic—as well as a concrete notion of what robust stream processing in SQL looks like.
Speaker
Anton Kedin, Google, Software Engineer
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
Effectively leveraging fast networking and storage hardware (e.g., RDMA, NVMe, etc.) in Apache Spark remains challenging. Current ways to integrate the hardware at the operating system level fall short, as the hardware performance advantages are shadowed by higher layer software overheads. This session will show how to integrate RDMA and NVMe hardware in Spark in a way that allows applications to bypass both the operating system and the Java virtual machine during I/O operations. With such an approach, the hardware performance advantages become visible at the application level, and eventually translate into workload runtime improvements. Stuedi will demonstrate how to run various Spark workloads (e.g, SQL, Graph, etc.) effectively on 100Gbit/s networks and NVMe flash.
The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters to support:
- Interactive data analysis
- In-memory analytics processing
- Batch ETL
- Continuous ETL pipeline (streaming)
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Foundations of streaming SQL: stream & table theoryDataWorks Summit
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how can all of this work in a programmatic framework like Apache Beam? The presentation answers these questions and more as it walks you through key concepts underpinning data processing in general.
Presentation explores the relationship between the Beam model (as described in paper “The Dataflow Mode”and the “Streaming 101”and “Streaming 102” blog posts) and stream and table theory (as popularized by Martin Kleppmann and Jay Kreps, among others).
It turns out that stream and table theory does an illuminating job of describing the low-level concepts that underlie the Beam model.
The presentation explains what is required to provide robust stream processing support in SQL and discusses the concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, as well as new ideas yet to come. You’ll leave with a much better understanding of the key concepts underpinning data processing—regardless of whether that data processing is batch or streaming or SQL or programmatic—as well as a concrete notion of what robust stream processing in SQL looks like.
Speaker
Anton Kedin, Google, Software Engineer
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
Effectively leveraging fast networking and storage hardware (e.g., RDMA, NVMe, etc.) in Apache Spark remains challenging. Current ways to integrate the hardware at the operating system level fall short, as the hardware performance advantages are shadowed by higher layer software overheads. This session will show how to integrate RDMA and NVMe hardware in Spark in a way that allows applications to bypass both the operating system and the Java virtual machine during I/O operations. With such an approach, the hardware performance advantages become visible at the application level, and eventually translate into workload runtime improvements. Stuedi will demonstrate how to run various Spark workloads (e.g, SQL, Graph, etc.) effectively on 100Gbit/s networks and NVMe flash.
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
HTrace is a new Apache incubator project which makes it much easier to diagnose and detect performance problems in HBase. It provides a unified view of the performance of requests, following them from their origin in the HBase client, through the HBase region servers, and finally into HDFS. System administrators can use a central web interface to query and view aggregate performance information for the whole cluster. This talk will cover the motivations for creating HTrace, its design, and some examples of how HTrace can help diagnose real-world HBase problems.
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
Tuning Apache Spark can be complex and difficult, since there are many different configuration parameters and metrics. As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it is no longer feasible for a small team of Spark experts to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy.
In order to achieve this, we automated the process of identifying performance issues and providing custom tuning advice to users, and made improvements for scaling to handle thousands of Spark applications per day. We leverage Spark History Server (SHS) to gather application metrics, but as the number of Spark applications and size of individual applications have increased, the SHS has not been able to keep up. It can fall hours behind during peak usage. We will discuss changes to the SHS to improve efficiency, performance and stability, enabling SHS to analyze large amount of logs. Another challenge we encountered was a lack of proper metrics related to Spark application performance. We will present new metrics added to Spark which can precisely report resource usage during runtime, and discuss how these are used in heuristics to identify problems.
Based on this analysis, custom recommendations are provided to help users tune their applications. We will also show the impact provided by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization.
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
In this presentation from the GPU Technology Conference, Wyatt Gorman from Google and Abhishek Gupta from Schlumberger present: Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger.
"Demand for GPUs in High Performance Computing is only growing, and it is costly and difficult to keep pace in an entirely on-premise environment. We will hear from Schlumberger on why and how they are utilizing cloud-based GPU-enabled computing resources from Google Cloud to supply their users with the computing power they need, from exploration and modeling to visualization."
Watch the video: https://wp.me/p3RLHQ-kcl
Learn more: https://www.blog.google/products/google-cloud/schlumberger-chooses-gcp-to-deliver-new-oil-and-gas-technology-platform/
and
https://www.nvidia.com/en-us/gtc/
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability and limits of each framework.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
Best practices for optimizing Red Hat platforms for large scale datacenter de...Jeremy Eder
This presentation is from NVIDIA GTC DC on Oct 23, 2018:
https://youtu.be/z5gEUL6dJRI
Corresponding Press Release: https://www.redhat.com/en/about/press-releases/red-hat-nvidia-align-open-source-solutions-fuel-emerging-workloads
Blog: https://www.redhat.com/en/blog/red-hat-and-nvidia-positioning-red-hat-enterprise-linux-and-openshift-primary-platforms-artificial-intelligence-and-other-gpu-accelerated-workloads
Demo Video:
https://www.youtube.com/watch?v=9iVYjA_WJgU
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
Big data and AI are joined at the hip: AI applications require massive amounts of training data to build state-of-the-art models. The problem is, big data frameworks like Apache Spark and distributed deep learning frameworks like TensorFlow don’t play well together due to the disparity between how big data jobs are executed and how deep learning jobs are executed.
Apache Spark 2.4 introduced a new scheduling primitive: barrier scheduling. User can indicate Spark whether it should be using the MapReduce mode or barrier mode at each stage of the pipeline, thus it’s easy to embed distributed deep learning training as a Spark stage to simplify the training workflow. In this talk, I will demonstrate how to build a real case pipeline which combines data processing with Spark and deep learning training with TensorFlow step by step. I will also share the best practices and hands-on experiences to show the power of this new features, and bring more discussion on this topic.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
HTrace is a new Apache incubator project which makes it much easier to diagnose and detect performance problems in HBase. It provides a unified view of the performance of requests, following them from their origin in the HBase client, through the HBase region servers, and finally into HDFS. System administrators can use a central web interface to query and view aggregate performance information for the whole cluster. This talk will cover the motivations for creating HTrace, its design, and some examples of how HTrace can help diagnose real-world HBase problems.
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
Tuning Apache Spark can be complex and difficult, since there are many different configuration parameters and metrics. As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it is no longer feasible for a small team of Spark experts to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy.
In order to achieve this, we automated the process of identifying performance issues and providing custom tuning advice to users, and made improvements for scaling to handle thousands of Spark applications per day. We leverage Spark History Server (SHS) to gather application metrics, but as the number of Spark applications and size of individual applications have increased, the SHS has not been able to keep up. It can fall hours behind during peak usage. We will discuss changes to the SHS to improve efficiency, performance and stability, enabling SHS to analyze large amount of logs. Another challenge we encountered was a lack of proper metrics related to Spark application performance. We will present new metrics added to Spark which can precisely report resource usage during runtime, and discuss how these are used in heuristics to identify problems.
Based on this analysis, custom recommendations are provided to help users tune their applications. We will also show the impact provided by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization.
There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts...
Speakers
Anant Chintamaneni, VP Products, BlueData
Nanda Vijaydev, Director Solutions, BlueData
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
In this presentation from the GPU Technology Conference, Wyatt Gorman from Google and Abhishek Gupta from Schlumberger present: Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger.
"Demand for GPUs in High Performance Computing is only growing, and it is costly and difficult to keep pace in an entirely on-premise environment. We will hear from Schlumberger on why and how they are utilizing cloud-based GPU-enabled computing resources from Google Cloud to supply their users with the computing power they need, from exploration and modeling to visualization."
Watch the video: https://wp.me/p3RLHQ-kcl
Learn more: https://www.blog.google/products/google-cloud/schlumberger-chooses-gcp-to-deliver-new-oil-and-gas-technology-platform/
and
https://www.nvidia.com/en-us/gtc/
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
Have you ever wondered how to implement your own operator pattern for you service X in Kubernetes? You can learn this in this session and see an example of open-source project that does spawn Apache Spark clusters on Kubernetes and OpenShift following the pattern. You will leave this talk with a better understanding of how spark-on-k8s native scheduling mechanism can be leveraged and how you can wrap your own service into operator pattern not only in Go lang but also in Java. The pod with spark operator and optionally the spark clusters expose the metrics for Prometheus so it makes it eas
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability and limits of each framework.
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
Slides for Amey Banarse's, Principal Data Architect at Yugabyte, "Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB" webinar recorded on Oct 30, 2019 at 11 AM Pacific.
Playback here: https://vimeo.com/369929255
Best practices for optimizing Red Hat platforms for large scale datacenter de...Jeremy Eder
This presentation is from NVIDIA GTC DC on Oct 23, 2018:
https://youtu.be/z5gEUL6dJRI
Corresponding Press Release: https://www.redhat.com/en/about/press-releases/red-hat-nvidia-align-open-source-solutions-fuel-emerging-workloads
Blog: https://www.redhat.com/en/blog/red-hat-and-nvidia-positioning-red-hat-enterprise-linux-and-openshift-primary-platforms-artificial-intelligence-and-other-gpu-accelerated-workloads
Demo Video:
https://www.youtube.com/watch?v=9iVYjA_WJgU
June 24, 2014. At Velocity 2014, Fastly engineer Vladimir Vuksan gave an intro to Ganglia concepts (grid, clusters, hosts) as well as an installation of a sample monitoring grid. He also goes through the following commonly used visualization tools and how they may aid in detecting issues, identifying causes, and taking corrective action:
- Cluster/Grid Views
- Aggregate graphs
- Compare Hosts
- Custom graph functionality
- Views
- Interactive graphs
- Trending
- Nagios/Alerting system integration
- How to add metrics to Ganglia
- Different export formats such as JSON, CSV, and XML
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
Starting with version 2.10, the Apache ZooKeeper dependency has been eliminated and replaced with a pluggable framework that enables you to reduce the infrastructure footprint of Apache Pulsar by leveraging alternative metadata and coordination systems based on your deployment environment. In this talk, walk through the steps required to utilize the existing etcd service running inside Kubernetes to act as Pulsar's metadata store, thereby eliminating the need to run ZooKeeper entirely, leaving you with a Zookeeper-less Pulsar.
Presentation slides from DevConf.cz 2017
Challenges, take-aways and recommendations on scaling up OpenShift's logging and metrics stack.
Authors:
Ricardo Lourenço:
https://www.linkedin.com/in/ricardopereira4it/
Elvir Kuric
https://www.linkedin.com/in/elvirkuric/
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco SlotCitus Data
One of the unique things about postgres is that extensions can add new functionality that falls outside the scope of a SQL database. Several postgres extensions add the ability to perform commands or query data on other servers. These extensions can be combined in interesting ways to form advanced distributed systems on top of postgres.
In this talk we will explore how extensions such as dblink, postgres_fdw, pglogical, pg_cron, and citus together with PL/pgSQL can be used as building blocks for distributed systems. We will give several demonstrations of using PostgreSQL as a distributed computing platform, including a MapReduce implementation that can transform very large tables, and a Kafka-like distributed queue.
Rudder recently got new features allowing to integrate data from various sources into the configuration policies. This talk will cover the data management workflow in Rudder, including the improvements in 4.0 and 4.1, focusing on real practical usecases.
In particular, we will go through the possible data flows: the data sources, that can be local to the server, the node or fetched from a remote API or another node, the data manipulation tools, in the server or in the policies, and finally the ways to use this data in the policies (as directive parameters, templating data, etc.)
This is a presentation I presented at NVIDIA AI Conference in Korea. It's about building the largest GPU - DGX-2, the most powerful supercomputer in one node.
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...Ontico
HighLoad++ 2017
Зал «Кейптаун», 7 ноября, 10:00
Тезисы:
http://www.highload.ru/2017/abstracts/2941.html
Шардинг в PostgreSQL - животрепещущая тема. Задача непростая и объемная, поэтому в сообществе пока нет единого плана, как ее решать. Мы расскажем о нашем экспериментальном подходе к шардингу, основанному на нескольких активно развивающихся технологиях - механизме FDW, расширении pg_pathman и логической репликации, вошедшей в ядро 10-ой версии. Подход претворяется в жизнь расширением pg_shardman, которое обитает здесь: https://github.com/postgrespro/pg_shardman
...
Similar to Greenplum Overview for Postgres Hackers - Greenplum Summit 2018 (20)
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
AI Genie Review: World’s First Open AI WordPress Website CreatorGoogle
AI Genie Review: World’s First Open AI WordPress Website Creator
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-genie-review
AI Genie Review: Key Features
✅Creates Limitless Real-Time Unique Content, auto-publishing Posts, Pages & Images directly from Chat GPT & Open AI on WordPress in any Niche
✅First & Only Google Bard Approved Software That Publishes 100% Original, SEO Friendly Content using Open AI
✅Publish Automated Posts and Pages using AI Genie directly on Your website
✅50 DFY Websites Included Without Adding Any Images, Content Or Doing Anything Yourself
✅Integrated Chat GPT Bot gives Instant Answers on Your Website to Visitors
✅Just Enter the title, and your Content for Pages and Posts will be ready on your website
✅Automatically insert visually appealing images into posts based on keywords and titles.
✅Choose the temperature of the content and control its randomness.
✅Control the length of the content to be generated.
✅Never Worry About Paying Huge Money Monthly To Top Content Creation Platforms
✅100% Easy-to-Use, Newbie-Friendly Technology
✅30-Days Money-Back Guarantee
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIGenieApp #AIGenieBonus #AIGenieBonuses #AIGenieDemo #AIGenieDownload #AIGenieLegit #AIGenieLiveDemo #AIGenieOTO #AIGeniePreview #AIGenieReview #AIGenieReviewandBonus #AIGenieScamorLegit #AIGenieSoftware #AIGenieUpgrades #AIGenieUpsells #HowDoesAlGenie #HowtoBuyAIGenie #HowtoMakeMoneywithAIGenie #MakeMoneyOnline #MakeMoneywithAIGenie
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
2. Greenplum
Greenplum is PostgreSQL
● with Massively Parallel Processing bolted on
● with lots of other extra features and addons
Greenplum is an Open Source project
7. Terminology QD/QE
Query Dispatcher (aka master)
● Accept client connections
● Manage distributed transactions
● Create query plans
Query Executor (aka segment)
● Store the data
● Exeute slices of the query
8. Deployment
● One QD, as many QE nodes as needed
● Multiple QEs on one physical server
○ To utilize Multiple CPU cores
● Gigabit ethernet
● Firewall
9. QD / QE roles
● Catalogs are the same in all nodes
○ OIDs of objects must match
● gpcheckcat to verify
● Utility mode
PGOPTIONS='-c gp_session_role=utility' psql postgres
10. Distributed plans
postgres=# explain select * from foo;
QUERY PLAN
-------------------------------------
Gather Motion 3:1 (slice1; segments: 3)
-> Seq Scan on foo
● Query plan is dispatched from QD to QE nodes
● Query results are returned from QEs via a Motion node
11. More complicated query
postgres=# explain select * from foo, bar where bar.c = foo.a;
QUERY PLAN
--------------------------------------------------------------
Gather Motion 3:1 (slice2; segments: 3)
-> Hash Join (cost=2.44..5.65 rows=2 width=12)
Hash Cond: bar.c = foo.a
-> Seq Scan on bar
-> Hash
-> Broadcast Motion 3:3 (slice1; segments: 3)
-> Seq Scan on foo
12. Slices
postgres=# explain select * from foo, bar where bar.c = foo.a;
QUERY PLAN
--------------------------------------------------------------
Gather Motion 3:1 (slice2; segments: 3)
-> Hash Join (cost=2.44..5.65 rows=2 width=12)
Hash Cond: bar.c = foo.a
-> Seq Scan on bar
-> Hash
-> Broadcast Motion 3:3 (slice1; segments: 3)
-> Seq Scan on foo
13. QD
QE segment 1
QE segment 3
QE segment 2
Process
QE backend process
(slice 2)
QE backend process
(slice 2)
QE backend process
(slice 1)
QE backend process
(slice 2)
QE backend process
(slice 1)
QE backend process
(slice 1)
QD backend
process
14. Interconnect
● Data flows between Motion Sender and
Receiver via the Interconnect
● UDP based protocol
17. Distributed planning
● Prepare a non-distributed plan
● Keep track of distribution keys at each step
● Add Motion nodes to make it distributed
● Different strategies to distribute aggregates
18. Data flows in one direction
● Nested loops with executor parameters cannot be dispatched
across query slices
○ Need to be materialized
○ The planner heavily discourages Nested loops
○ Hash joins are preferred
● Subplans with parameters need to use seqscans and must be fully
materialized
○ “Skip-level” subplans not supported
19. ORCA
● Alternative optimizer
● Written in C++
● Shared with Apache HAWQ
● https://github.com/greenplum-db/gporca
● Raw query is converted to XML, sent to ORCA, and converted back
● Slow startup time, better plans for some queries
20. Distributed transactions
● Every transaction uses two-phase commit
● QD acts as the transaction manager
● Distributed snapshots to ensure consistency
across nodes
22. Mirroring
● If one QE node goes down, no queries can be run:
postgres=# select * from foo;
ERROR: failed to acquire resources on one or more
segments
DETAIL: could not connect to server: Connection refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 40002?
(seg2 127.0.0.1:40002)
23. Mirroring
● Each node has a mirror node as backup
● GPDB 5:
○ Custom “file replication” between master and mirror QE node
● Revamped in GPDB 6:
○ Uses PostgreSQL WAL replication
25. Append-optimized Tables
● Alternative to normal heap tables
● CREATE TABLE … WITH (appendonly=true)
● Optimized for appending in large batches and sequential scans
● Compression
● Column- or row-oriented
26. Bitmap indexes
● More dense than B-tree for duplicates
● Good for columns with few distinct values
29. $ ls -l
drwxr-xr-x 5 heikki heikki 4096 Jun 30 11:37 concourse
drwxr-xr-x 2 heikki heikki 4096 Jun 30 11:37 config
-rwxr-xr-x 1 heikki heikki 528544 Jun 30 11:37 configure
-rw-r--r-- 1 heikki heikki 78636 Jun 30 11:37 configure.in
drwxr-xr-x 54 heikki heikki 4096 Jun 30 11:37 contrib
-rw-r--r-- 1 heikki heikki 1547 Jun 30 11:37 COPYRIGHT
drwxr-xr-x 3 heikki heikki 4096 Jun 30 11:37 doc
-rw-r--r-- 1 heikki heikki 8253 Jun 30 11:37 GNUmakefile.in
drwxr-xr-x 8 heikki heikki 4096 Jun 30 11:37 gpAux
drwxr-xr-x 4 heikki heikki 4096 Jun 30 11:37 gpdb-doc
drwxr-xr-x 7 heikki heikki 4096 Jun 30 11:37 gpMgmt
-rw-r--r-- 1 heikki heikki 11358 Jun 30 11:37 LICENSE
-rw-r--r-- 1 heikki heikki 1556 Jun 30 11:37 Makefile
-rw-r--r-- 1 heikki heikki 113093 Jun 30 11:37 NOTICE
-rw-r--r-- 1 heikki heikki 2051 Jun 30 11:37 README.amazon_linux
-rw-r--r-- 1 heikki heikki 1514 Jun 30 11:37 README.debian
-rw-r--r-- 1 heikki heikki 19812 Jun 30 11:37 README.md
-rw-r--r-- 1 heikki heikki 1284 Jun 30 11:37 README.PostgreSQL
drwxr-xr-x 14 heikki heikki 4096 Jun 30 11:37 src
30. Code layout
Same as PostgreSQL:
● src/, contrib/, configure
Greenplum-specific:
● gpMgmt/
● gpAux/
● concourse/
31. Catalog tables
● Extra catalog tables for GPDB functionality
○ pg_partition, pg_partitition_rule, gp_distribution_policy, …
● Extra columns in pg_proc, pg_class
○ a Perl script provides defaults for them at build time
32. Documentation
● doc/ contains the user manual in PostgreSQL
○ All but reference pages removed in Greenplum
● gpdb-doc/ contains the Greenplum manuals
34. Initialize a cluster (wrong)
~/gpdb.master$ bin/initdb -D data
The files belonging to this database system will be owned by user "heikki".
This user must also own the server process.
…
Success. You can now start the database server using:
bin/postgres -D data
or
bin/pg_ctl -D data -l logfile start
~/gpdb.master$ bin/postgres -D data
2017-06-30 08:50:05.440267
GMT,,,p16890,th-751686272,,,,0,,,seg-1,,,,,"FATAL","22023","dbid (from -b
option) is not specified or is invalid. This value must be >= 0, or >= -1 in
utility mode. The dbid value to pass can be determined from this server's
entry in the segment configuration; it may be -1 if running in utility
mode.",,,,,,,,"PostmasterMain","postmaster.c",1174,
35. Creating a cluster
● gpinitsystem – runs initdb on each node
● gpstart / gpstop – runs pg_ctl start/stop on each node
● For quick testing and hacking on your laptop:
make cluster
36. Regression tests
● Regression test suite in src/test/regress
● Core is the same as in PostgreSQL
● Greatly extended to cover Greenplum-specific features
● Needs a running cluster
make -C src/test/regress installcheck
37. gpdiff.pl
● In GPDB, rows can arrive from QE nodes in any order
● Post-processing step in installcheck masks out the row-order
differences
40. Yet more tests
● Pivotal runs a Concourse CI instance to run “make
installcheck-world”
● And some additional tests that need special setup
● Scripts in concourse/ directory in the repository