Presto User Group Singapore Meetup - March 2019.
This presentation talks about the Grab's deployment of Presto and walks you through Grab's journey of Presto in Cloud.
Enabling Presto to handle massive scale at lightning speedShubham Tagra
Presto User Group Singapore Meetup - March 2019.
These slides talk through the current state of Presto and features that help Presto work better in cloud and a glimpse into the roadmap
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Shubham Tagra
Strata SF 2019 presentation about presto's limitation in leveraging spot nodes, qubole's features to reliably use spot nodes in presto and case study on the efficacy of the solution
Enabling Presto to handle massive scale at lightning speedShubham Tagra
Presto User Group Singapore Meetup - March 2019.
These slides talk through the current state of Presto and features that help Presto work better in cloud and a glimpse into the roadmap
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019Shubham Tagra
Strata SF 2019 presentation about presto's limitation in leveraging spot nodes, qubole's features to reliably use spot nodes in presto and case study on the efficacy of the solution
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC
Speaker: Oskari Saarenmaa
Aiven PostgreSQL is available in five different public cloud providers' infrastructure in more than 60 regions around the world, including 18 in APAC. This has given us a unique opportunity to benchmark and compare performance of similar configurations in different environments.
We'll share our benchmark methods and results, comparing various PostgreSQL configurations and workloads across different clouds.
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources. Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other. In this talk, we will explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
Treasure Data simplifies event analytics for the complex digital
world. Our customers send us 1,000,000 events per second and issue 30,000+ Presto queries everyday to understand their customers better. One of the challenges is designing a cloud database with zero downtime to support a global customer base. We have achieved this goal by developing several open-source technologies; Fluentd and Embulk enable seamless log collection from stream/batch sources, and with MessagePack we can provide an extensible columnar store that accommodates future schema changes. Finally, Presto allows us to serve a wide variety of data processing our customers perform on our service. In this talk, I will present an overview of our system, and how our customers keep using Presto while collecting and extending their data set.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB
Come hear how QOMPLX, a leader in Cyber Security Risk Management solutions uses ScyllaDB and JanusGraph to detect, manage and assess risks for large corporate and government clients. By leveraging two highly horizontally scalable and fault tolerant technologies, QOMPLX can flex with their clients' needs.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
What's streaming processing? The evolution of streaming SQL. It's advantages & challenges, and how we can overcome them. Presented at WSO2 Con 2018 USA
___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
How we have used ansible for real-time industry use cases and Integration with enterprise tools. Infra provisioning and config management using ansible and automating routine tasks.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC
Speaker: Oskari Saarenmaa
Aiven PostgreSQL is available in five different public cloud providers' infrastructure in more than 60 regions around the world, including 18 in APAC. This has given us a unique opportunity to benchmark and compare performance of similar configurations in different environments.
We'll share our benchmark methods and results, comparing various PostgreSQL configurations and workloads across different clouds.
Sizing a database cluster makes or breaks your application. Too small and you could sustain spikes in usage and recover from a node loss or an operational slowdown. Too big and your cluster will cost more and waste valuable human resources. Since different workloads have different requirements, successful sizing of your application should be optimized for both throughput and latency performance. However, in many cases, the requirements for each contradicts each other. In this talk, we will explain how to remediate the contradicting forces and build a sustainable cluster to meet both performance and resiliency requirements.
Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito
Treasure Data simplifies event analytics for the complex digital
world. Our customers send us 1,000,000 events per second and issue 30,000+ Presto queries everyday to understand their customers better. One of the challenges is designing a cloud database with zero downtime to support a global customer base. We have achieved this goal by developing several open-source technologies; Fluentd and Embulk enable seamless log collection from stream/batch sources, and with MessagePack we can provide an extensible columnar store that accommodates future schema changes. Finally, Presto allows us to serve a wide variety of data processing our customers perform on our service. In this talk, I will present an overview of our system, and how our customers keep using Presto while collecting and extending their data set.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB
Come hear how QOMPLX, a leader in Cyber Security Risk Management solutions uses ScyllaDB and JanusGraph to detect, manage and assess risks for large corporate and government clients. By leveraging two highly horizontally scalable and fault tolerant technologies, QOMPLX can flex with their clients' needs.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
What's streaming processing? The evolution of streaming SQL. It's advantages & challenges, and how we can overcome them. Presented at WSO2 Con 2018 USA
___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
How we have used ansible for real-time industry use cases and Integration with enterprise tools. Infra provisioning and config management using ansible and automating routine tasks.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Stitch Fix aspires to help you find the style that you will love. Data, the backbone of the business, is used to help with styling recommendations, demand modeling, user acquisition, and merchandise planning and also to influence business decisions throughout the organization. These decisions are backed by algorithms and data collected and interpreted based on client preferences. This talk offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way.
Apache Spark plays an important role in Stitch Fix’s data platform, and the company’s data scientists use Spark for their ETL and Presto for their ad hoc queries. The goal for the team running the compute infrastructure is to understand and make the data scientists’ lives easier, particularly in terms of usability of Spark, by building tools that make it easier to get started with Spark and transition themselves to a daily workflow. The compute infrastructure is a part of the data platform that is responsible for all the needs of data scientists as Stitch Fix.
In this talk, we look at Stitch Fix’s journey, exploring its Spark setup, in-house tools and how they work in synergy with open source frameworks in a cloud environment. There are additional improvements to the infrastructure that help persist information for future use and optimization and we look at how the implementation of Amazon’s EMR FS has helped make it easier for us to read from the S3 source.
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
Migrating an Oracle database to Postgres is never an automated operation. And it rarely (never?) involve just the database. Experience brought us to develop an agile methodology for the migration process, involving schema migration, data import, migration of procedures and queries up to the generation of unit tests for QA.
Pitfalls, technologies and main migration opportunities will be outlined, focusing on the reduction of total costs of ownership and management of a database solution in the middle-long term (without reducing quality and business continuity requirements).
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Most database products have their own auditing functionalities or plugins but they always involve overhead which means they end up having them turned off or with the bare minimum enabled.
In this workshop we will show how to get reliable logging for mysql and mongodb servers in a scalable and non intrusive way, its drawbacks and how we can build our own open source tools to achieve results similar to most commercial products.
Tools to sniff, process and act upon queries will be shared and we will show how simple is to set up and monitor a database environment so it can be replicated and grow horizontally. All the code needed will be published.
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsStreamNative
Pulsar Function is a succinct computing abstraction Apache Pulsar provides to express simple ETL and streaming tasks. The simplicity comes in two folds: Simple Interface and Simple Deployment. As it has been adopted, we realized that the ability to run natively on cloud and integrate multiple functions into one integrity are key to user success. We developed this new feature -- Function Mesh -- to support these new requirements.
This talk aims to provide a thorough walkthrough of this new Function Mesh Feature, including its design, implementation, use cases, and examples, to help people seeking simple streaming solutions understand this newly created powerful tool in Apache Pulsar.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Similar to Journey and evolution of Presto@Grab (20)
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
12. Datagateway
● Single entry point for all queries
● Authentication and Authorisation service
● Grant access to schemas, tables and clusters
● Integrated with API endpoints of Presto
13. Issues on EMR
● Configuration Changes
● Cluster Administration
16. Benchmarking
Two types of tests:
● Functional Tests
● Performance Tests
Within Performance tests, we simulated the cluster workloads to check the performance
improvements.
18. Resource Groups
● Initially used the queue configs to apply concurrency
limit based on the query source
● Currently enforce concurrency with resource groups
along with separate resource allocation for exploratory
queries
19. Task Writers
● Number of concurrent writer threads per query per
worker
● Default: 1
● Current: 8
20. Client Timeout
● Duration of post which query execution will time out if
not polled
● The query will be regarded as Abandoned
● Default: 2 mins
● Current: 10mins (30min)
21. Query History
● Query history retained by the presto UI for lookback
● Default: 100
● Current: 900 (90k)
22. Event Listener
● Supports custom plugins which are invoked on
○ Query Creation
○ Query Completion
○ Split completion
● Enables us to log every query submitted to cluster along
with important metrics which are used to fine tune
config further
24. Presto Test Suite
● Internal platform used to analyse the query
performance
● Enables us to quantify the impact of various config
changes
● Simulate the cluster workload
25. Query Analysis
● Helps us identify table usage pattern
● Track table lineages, join patterns
● Recreate the query plan to understand the query
execution