Mixing Analytic Workloads with Greenplum and Apache SparkVMware Tanzu
Apache Spark is a popular in-memory data analytics engine because of its speed, scalability, and ease of use. It also fits well with DevOps practices and cloud-native software platforms. It’s good for data exploration, interactive analytics, and streaming use cases.
However, Spark, like other data-processing platforms, is not one size fits all. Different versions of Spark support different feature sets, and Spark’s machine-learning libraries can also vary in important ways between versions, or may lack the right algorithm.
In this webinar, you’ll learn:
- How to integrate data warehouse workloads with Spark
- Which workloads are better for Greenplum and for Spark
- How to use the Greenplum-Spark connector
Presenter: Kong Yew Chan, Product Manager, Pivotal
Mixing Analytic Workloads with Greenplum and Apache SparkVMware Tanzu
Apache Spark is a popular in-memory data analytics engine because of its speed, scalability, and ease of use. It also fits well with DevOps practices and cloud-native software platforms. It’s good for data exploration, interactive analytics, and streaming use cases.
However, Spark, like other data-processing platforms, is not one size fits all. Different versions of Spark support different feature sets, and Spark’s machine-learning libraries can also vary in important ways between versions, or may lack the right algorithm.
In this webinar, you’ll learn:
- How to integrate data warehouse workloads with Spark
- Which workloads are better for Greenplum and for Spark
- How to use the Greenplum-Spark connector
Presenter: Kong Yew Chan, Product Manager, Pivotal
The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters to support:
- Interactive data analysis
- In-memory analytics processing
- Batch ETL
- Continuous ETL pipeline (streaming)
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
PostgreSQL is versatile and used for a wide range of applications and use cases in the enterprise. It is more than just database technology, it is an accelerator for innovation. Much innovation today is happening in new application development, application modernization, and re-platforming to the cloud across the information architecture landscape. In this webinar, you will learn how EDB supercharges PostgreSQL to re-platform to cloud and containers more efficiently and develop new applications that are more scalable and secure.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Apache AGE and the synergy effect in the combination of Postgres and NoSQLEDB
In this session, we will introduce the concept of Apache AGE and the synergy effect in the combination of Postgres and NoSQL (Graph Database). We shall discuss the story and background of Apache AGE as an open-source project and introduce challenges that AGE can solve for its users. Furthermore, we will talk about a graph database as an extension to PostgreSQL and how it can support all the functionalities and features of PostgreSQL and offers a graph model in addition. We will also discuss how users with a relational background and data model who are in need of having a graph model on top of their existing relational model, can use this extension with minimal effort because they can use existing data without migration to enable a graph database.
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"
The Pivotal Greenplum-Spark Connector provides high speed, parallel data transfer between Greenplum Database and Apache Spark clusters to support:
- Interactive data analysis
- In-memory analytics processing
- Batch ETL
- Continuous ETL pipeline (streaming)
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
Change Data Streaming Patterns for Microservices With Debezium confluent
(Gunnar Morling, RedHat) Kafka Summit SF 2018
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/): secret sauce for change data capture (CDC) streaming changes from your datastore that enables you to solve multiple challenges: synchronizing data between microservices, gradually extracting microservices from existing monoliths, maintaining different read models in CQRS-style architectures, updating caches and full-text indexes and feeding operational data to your analytics tools
Join this session to learn what CDC is about, how it can be implemented using Debezium, an open source CDC solution based on Apache Kafka and how it can be utilized for your microservices. Find out how Debezium captures all the changes from datastores such as MySQL, PostgreSQL and MongoDB, how to react to the change events in near real time and how Debezium is designed to not compromise on data correctness and completeness also if things go wrong. In a live demo we’ll show how to set up a change data stream out of your application’s database without any code changes needed. You’ll see how to sink the change events into other databases and how to push data changes to your clients using WebSockets.
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
Sherlock is an anomaly detection service built on top of Druid. It leverages EGADS (Extensible Generic Anomaly Detection System; github.com/yahoo/egads) to detect anomalies in time-series data. Users can schedule jobs on an hourly, daily, weekly, or monthly basis, view anomaly reports from Sherlock's interface, or receive them via email.
Sherlock has four major components: timeseries generation, EGADS anomaly detection, Redis backend and Spark Java UI. Timeseries generation involves building, validating, querying, parsing the Druid query. Parsed Druid response is then fed to EGADS anomaly detection component which detects and generates the anomaly reports for each input time-series data. Sherlock uses Redis backend to store jobs metadata, generated anomaly reports and persistent job queue for scheduling jobs, etc. Users can choose to have a clustered Redis or standalone Redis. Sherlock provides user interface built with Spark Java. The UI enables users to submit instant anomaly analysis, create, and launch detection jobs, view anomalies on a heatmap and on a graph. Jigarkumar Patel, Software Development Engineer I, Oath Inc. and, David Servose, Software Systems Engineer, Oath
PostgreSQL is versatile and used for a wide range of applications and use cases in the enterprise. It is more than just database technology, it is an accelerator for innovation. Much innovation today is happening in new application development, application modernization, and re-platforming to the cloud across the information architecture landscape. In this webinar, you will learn how EDB supercharges PostgreSQL to re-platform to cloud and containers more efficiently and develop new applications that are more scalable and secure.
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
Julien Le Dem, Principal Engineer, WeWork
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Apache AGE and the synergy effect in the combination of Postgres and NoSQLEDB
In this session, we will introduce the concept of Apache AGE and the synergy effect in the combination of Postgres and NoSQL (Graph Database). We shall discuss the story and background of Apache AGE as an open-source project and introduce challenges that AGE can solve for its users. Furthermore, we will talk about a graph database as an extension to PostgreSQL and how it can support all the functionalities and features of PostgreSQL and offers a graph model in addition. We will also discuss how users with a relational background and data model who are in need of having a graph model on top of their existing relational model, can use this extension with minimal effort because they can use existing data without migration to enable a graph database.
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Deploying software and controlling infrastructure quickly and safely is a hard task.
In this talk, Brice Fernandes, Customer Success Engineer at Weaveworks, discusses GitOps, an operational model for Kubernetes and beyond to speed up development, while retaining extremely strong security guarantees. Brice describes and shows several open source tools developed at Weaveworks to support this approach. You will have a good idea of how to use the GitOps principles to create software pipelines that are fast, safe, and reproducible, while creating clear and high quality audit trails.
Check out the full presentation on YouTube: https://youtu.be/QdCwUUtcj4I
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Gree...PivotalOpenSourceHub
In this session we explore a case study of a large-scale government fraud detection program that prevents billions of dollars in fraudulent payments each year leveraging the beta release of the GemFire+Greenplum Connector, which is planned for release in GemFire 9. Topics will include an overview of the system architecture and a review of the new GemFire+Greenplum Connector features that simplify use cases requiring a blend of massively parallel database capabilities and accelerated in-memory data processing.
Big Data London v 11.0 I 'Distributed Machine & Deep Learning at Scale with A...Dataconomy Media
With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time.
Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning.
Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started.
Topics include:
— Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons
— Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own
— Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite
— How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Mydbops
This talk focuses on the challenges and strategies to be followed when you're planning for a MySQL upgrade from the End of life versions of MySQL 5.5, 5.6, Even 5.7.
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
With advances in computer hardware such as 10 gigabit network cards, infiniband, and solid state drives all becoming commodity offerings, the new bottleneck in big data technologies is very commonly the processing power of the CPU. In order to meet the computational demand desired by users, enterprises have had to resort to extreme scale out approaches just to get the processing power they need. One of the most well known technologies in this space, Apache Spark, has numerous enterprises publicly talking about the challenges in running multiple 1000+ node clusters to give their users the processing power they need. This talk is based on work completed by NVIDIA’s Applied Solutions Engineering team. Attendees will learn how they were able to GPU-accelerate UDFs in PySpark using open source technologies such as Numba and PyGDF, the lessons they learned in the process, and how they were able to accelerate workloads in a fraction of the hardware footprint.
In this session, Helena and Christopher from Bergzeit describe two possible solutions to rebuild custom session level channel groups with GA4 raw data, using the dbt framework. The goal of the contained code is to bring the GA4 raw data as close as possible to the session level custom channel groups displayed in the GA4 UI.
If you see shortcomings in the custom channel groups in the GA4 UI, or if you are in need of session level channel groups in your GA4 raw data for reporting or other purposes this repository may be for you. If you use dbt you can use the documented code blocks with some minor adjustments. If you do not use dbt the code may still give you some inspiration on the process. The accompanying code repository can be found here: https://github.com/hellste/dbt_ga4_custom_channelgroups
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Deploying MariaDB for HA on Google Cloud PlatformMariaDB plc
Google Cloud Platform (GCP) is a rising star in the world of cloud infrastructure. Of course there is Google CloudSQL, but sometimes you need more control over your databases. In this session, Matthias Crauwels of Pythian guides you through the process of deploying and managing MariaDB in the cloud on GCP.
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
Customer Data Platforms, commonly called CDPs, form an integral part of the marketing stack powering Zeotap's Adtech and Martech use-cases. The company offers a privacy-compliant CDP platform, and ScyllaDB is an integral part. Zeotap's CDP demands a mix of OLTP, OLAP, and real-time data ingestion, requiring a highly-performant store.
In this presentation, Shubham Patil, Lead Software Engineer, and Safal Pandita, Senior Software Engineer at Zeotap will share how ScyllaDB is powering their solution and why it's a great fit. They begin by describing their business use case and the challenges they were facing before moving to ScyllaDB. Then they cover their technical use-cases and requirements for real-time and batch data ingestions. They delve into our data access patterns and describe their data model supporting all use cases simultaneously for ingress/egress. They explain how they are using Scylla Migrator for our migration needs, then describe their multiregional, multi-tenant production setup for onboarding more than 130+ partners. Finally, they finish by sharing some of their learnings, performance benchmarks, and future plans.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Graphene is currently the most popular framework for building a GraphQL in Python and it’s also an obvious choice for adding a GraphQL layer to Django applications. Over the course of a year, we successfully built an API with about 50 queries and over 100 mutations on top of existing Django project (Saleor), but we also learned some hard lessons and had to overcome several shortcomings of the framework along the way.
In this talk, I’d like to share some practical tips to overcome the most common problems that a Django developer might face when building an optimized and maintainable API with Graphene, such as: - using useful abstractions to build queries and mutations faster - optimizing database queries in a graph - structuring a large Graphene project - unified error handling
I’d also like to bring up a few limitations of the framework that we discovered as we were working on the project and then end the talk with the most important benefits that adoption of GraphQL brings to modern web applications development - both for the backend and frontend.
Building Pinterest Real-Time Ads Platform Using Kafka Streams confluent
Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018
In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business.
The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiKim Kao
Ian Tsai shared his past 3years developer journey at Linkedin. it was about migrate monolith into microservices 3 years ago, he faced so diffcult challenges and need to have effective tools to support the change.
Learn how to build advanced GraphQL queries, how to work with filters and patches and how to embed GraphQL in languages like Python and Java. These slides are the second set in our webinar series on GraphQL.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Gerrit Analytics applied to Android source codeLuca Milanesio
GerritForge trialled the Gerrit Analytics plugin and ETL with the Android Open-Source Project code-base. The results of the trial have been presented at the Gerrit User Summit 2019 at Gothenburg and Sunnyvale CA. Find inside an overview of the problems involved, the solutions implemented and also the use of the pull-replication plugin to fetch the code from the official Android repository.
The Answer to the NGS Data Analysis Challenges in Agricultural Biotechnology ...GENALICE
What are the key challenges in plant genomics? What efficient bioinformatics software technologies can be used to tackle them? The presentation introduces GENALICE MAP product characteristics, its quality of speed, an introduction to the revolutionary Population Calling Module and deployment options. For more information, please visit http://www.genalice.com/product/genalice-map/
Similar to Present & Future of Greenplum Database A massively parallel Postgres Database - Greenplum Summit 2019 (20)
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
4. GPDB v5: Mission Critical Analytical Database
Platform
Well rounded and proven feature set:
● Proven in Mission Critical Use Cases
● ORCA Optimizer
● Resource Groups & PGBouncer for Concurrency
● In-Database Analytics
● External Data Federation Ecosystem
● Pivotal Greenplum Command Center 4.x
● Updated Backup and Migration Tooling
“Pivotal Greenplum is
often used in mission-
critical use cases,
where downtime is
not well-tolerated.”
-- Gartner MQ 2019
6. GPDB v6: Massive Postgres Power
What if Greenplum was a Superset and not a
subset of Postgres
● Postgres 9.4 merged
● WAL Replication
● Row Level Locking for Updates/Deletes
● Foreign Data Wrapper API
● PG Extensions: e.g. pgaudit
● Recursive CTE
● JSON, JSONB, FTS, GIN Index
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019
7. GPDB v6: OLTP Performance with Greenplum
Up to 50x Performance gain on pgbench in
early testing
● Greenplum has always been ACID with
Transaction semantics
● Many Analytical Systems Require a Mix of
Analytical and OLTP Queries
● Remove Table Lock on Updates & Deletes
● Distributed Deadlock Detector introduced
● Concurrent OLTP Operations allowed
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019
8. V6: Big Data Features
#ScaleMatters
● Online Expansion w/ Jump Consistent Hash
● Star-Schema DW with Replicated Tables
● Join Aggregrate Query Perf with Eager
Aggregation Optimizations
● zStandard compression
“Reference customers
for Pivotal praised the
overall performance
and scalability of
Pivotal Greenplum”
-- Gartner MQ 2019
9. GP v5 Expand Example
Distributed by Call ID
Detailed Call Records
Example
Call id 1
Call id 4
Call id 7
Call id 10
Call id 2
Call id 5
Call id 8
Call id 11
Call id 3
Call id 6
Call id 9
Call id 12
Call id 1
Call id 5
Call id 9
Call id 2
Call id 6
Call id 10
Call id 3
Call id 7
Call id 11
Call id 4
Call id 8
Call id 12
RESHUFFLE
ALL GPEXPAND
10. GP v6 Online Expand w/ Jump Consistent Hash
Distributed by Call ID
Detailed Call Records
Example
Call id 1
Call id 4
Call id 7
Call id 10
Call id 2
Call id 5
Call id 8
Call id 11
Call id 3
Call id 6
Call id 9
Call id 12
Call id 1
Call id 4
Call id 7
Call id 2
Call id 5
Call id 8
Call id 3
Call id 6
Call id 9
Call id 10
Call id 11
Call id 12
MINIMAL DATA
MOVEMENT
GPEXPAND
12. Eager-Agg Optimization in GPDB v6
create table foo (j1 int, g1 int, s1 int);
insert into foo select i%10000, i %1000, i from generate_series(1,100000000) i;
● 10,000 unique grouping columns
● 1000 unique join columns
create table bar (j2 int, g2 int, s2 int);
insert into bar select i%100, i %1000, i from generate_series(1,100000) i;
● 1000 unique grouping columns
● 100 unique join columns
Query:
select sum(s1)
from foo, bar
where j1 = j2 and s1%2 = 0
group by g1, g2;
Greenplum v5 63.8 seconds
Greenplum v6 7.4 seconds
~
9X
Im
provem
ent
13. Aggregate Queries over Join GPDB v5
Find the loss per line item for
all returned items
Join the line items to the
orders
Group them by store and
compute the aggregate loss
Straightforward translation of
the query into the query plan
If each order has a large
number of line items, the join
results can be quite large and
expensiveLINEITEM ORDERS
! L_LOSS:
L_EXTENDEDPRICE * (I-
L_DISCOUNT)
⨝ (L_ORDERKEY =
O_ORDERKEY)
#O_STORE (SUM(L_LOSS))
σ (L_RETURNFLAG = “R”)
14. Eager Agg Optimization GPDB v6
● Find the loss of revenue
for each order
● Join the aggregated
view with table ORDERS
● Compute the total loss
for each store
● Benefit: Inner group-by
reduces the number of
row to the join
[Yan95] W. P. Yan and P.
Larson, "Eager Aggregation
and Lazy Aggregation",
VLDB 1995
LINEITEM ORDERS
! L_LOSS:
L_EXTENDEDPRICE * (I-
L_DISCOUNT)
⨝ (L_ORDERKEY =
O_ORDERKEY)
# O_STORE
SUM(L_ORDERLOSS)
σ (L_RETURNFLAG = “R”)
# L_ORDERKEY
L_ORDERLOSS:
SUM(L_LOSS)
15. GPDB v6 zStd Compression
Same or more for less
● Open Source
● Lower CPU Cycles with same or better compression
● Originated at Facebook
CREATE TABLE call_data_records(callid int4, calldetails json)
WITH (appendonly=true, compresstype=zstd, orientation=column)
DISTRIBUTED BY (callid);
17. Containerized Greenplum w/ GPDB v6● GP embedded in
containers for
portability and
dependency
management
● Containers
managed by
Kubernetes for
higher availability
and elasticity
● Kubernetes
operator used for
automation
Container
Operator AUTOMATION
AUTOMATION
AUTOMATION
pod pod
19. GPDB v7: Beyond the Cluster
We have all this Postgres infrastructure in
GPDB v6 now lets use it
● Postgres 9.6 target
● DB Snapshots / Backup
● Streaming Replication
● Log Shipping and Reconciliation
● Greenplum as a source for Kafka
● Greenplum as a source for CDC Tools
● Greenplum to Greenplum Inter Cluster Queries
“You do this and you
can beat Oracle”
-- US Federal
Customer, 2018
20. GPDB v7: Thought Leadership in Database AI
Define Artificial Intelligence. Does it make
sense to integrate intelligence into an
analytical platform?
● 2019 Apache Madlib is focused on Deep Learning
and GPU processing
● 2019 Pivotal’s GPText Solution will add more
cognitive intelligence of human language
● Combine with existing functions: PostGIS
Geospatial; Apache Madlib Machine Learning &
Graph; Python, R libraries, SQL at scale
● This is a platform for modern AI!
“With the Apache MADlib
analytics libraries, Pivotal
Greenplum has capable
in-database analytics that
allow for predictive
modeling and ML to be
applied to relational
data.” -- Gartner MQ 2019