Presto: Distributed sql query engine

•Download as PPTX, PDF•

14 likes•9,568 views

Presto is an open source distributed SQL query engine that allows querying large datasets ranging from gigabytes to petabytes faster and more interactively. It employs a custom query execution engine with pipelined operators designed for SQL semantics, avoiding unnecessary I/O and latency overhead. The Presto coordinator parses, analyzes, and plans queries, assigning work to nodes closest to data and monitoring progress, while clients pull results from output stages. Presto developers claim it is 10x better than Hive/MapReduce for most queries in terms of efficiency and latency.

Technology

Problem to solve
 Huge production of data.
 As data is growing enormously to the point of peta bytes
, querying the database has become a big issue.
 So we should be able to run more interactive queries and get
results faster .

Introduction
 Presto is a open source distributed sql query engine.
 For running queries against of all sizes ranging from
gigabytes to petabytes .
 It supports ANSI SQL ,including complex
queries,aggresgations,joins and window functions .
 It is implemented in java.

Architecture Explanation
 Client sends sql to presto coordinator.
 Coordinator parses ,analyzes and plans the query execution.
 The scheduler wires together the execution pipeline ,assigns
work to nodes closest to data and monitors the progress.
 The client pulls the data from output stage which in turn pulls
data from underlying stages.

Hive/Mapreduce Execution model
 Hive translates queries into multiple stage of mapreduce
tasks and execute them one after the other.
 Each task reads input from disk and writes intermediate
output back to disk.

Presto Execution
 Presto engine does not use Mapreduce.
 It employs a custom query and execution engine with
operators designed to support sql semantics.
 Processing is in memory and pipelined across the network
between stages which avoids unnecessary I/O and
associated latency overhead.
 Pipelined execution model runs multiple stages at once and
streams data from one stage to next as it becomes available
which reduces end-to-end latency

Note
 Presto dynamically compiles certain portions of query plan to
byte code which lets JVM optimize and generate native
machine code.

Extensibility
 Presto was designed with a simple storage abstraction that
makes its easy to provide sql query capability against
disparate data sources.
 Connectors only need to provide interfaces for fetching meta
data, getting data locations and accessing data itself.

Limitations
 Size limitation on the join tables and cardinality of unique
groups.
 Lacks the ability to write output back to tables. Currently
query results are streamed to client.

Presto developers claim:
 Presto is 10x better than hive/Mapreduce in terms of cpu
efficiency and latency for most queries.
 Supports ANSI sql, including joins, left/right outer
joins,subqueries,most of the common aggregate and scalar
functions, including approximate distinct counts,
approximate percentiles

What's hot

Apache Spark Fundamentals

Zahra Eskandari

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Databricks

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Databricks

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Spark shuffle introduction

colorant

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB

YugabyteDB

Distributed SQL Databases Deconstructed

Yugabyte

Introducing DataFrames in Spark for Large Scale Data Science

Databricks

Redis vs Infinispan | DevNation Tech Talk

Red Hat Developers

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Deep Dive: Memory Management in Apache Spark

Databricks

The columnar roadmap: Apache Parquet and Apache Arrow

Julien Le Dem

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table. In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.

Change Data Feed in Delta

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

The Parquet Format and Performance Optimization Opportunities

Databricks

Airflow Best Practises & Roadmap to Airflow 2.0

Kaxil Naik

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Physical Plans in Spark SQL

Databricks

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Spark rdd vs data frame vs dataset

Ankit Beohar

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

Iceberg: A modern table format for big data (Strata NY 2018)

Ryan Blue

What's hot (20)

Apache Spark Fundamentals

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Apache Iceberg - A Table Format for Hige Analytic Datasets

Spark shuffle introduction

Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB

Distributed SQL Databases Deconstructed

Introducing DataFrames in Spark for Large Scale Data Science

Redis vs Infinispan | DevNation Tech Talk

Deep Dive: Memory Management in Apache Spark

The columnar roadmap: Apache Parquet and Apache Arrow

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Change Data Feed in Delta

The Parquet Format and Performance Optimization Opportunities

Airflow Best Practises & Roadmap to Airflow 2.0

Physical Plans in Spark SQL

Real-time Analytics with Trino and Apache Pinot

Spark rdd vs data frame vs dataset

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Iceberg: A modern table format for big data (Strata NY 2018)

Viewers also liked

Facebook Presto presentation

Cyanny LIANG

Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. The project has a rapidly growing community of users, including Airbnb, FINRA, Netflix, Twitter, and Uber. Kamil Bajda-Pawlikowski explores the key architectural components that allow querying variety of data sources and make Presto uniquely position to be applied in both Hadoop and Cloud use cases. Along the way, Kamil covers Teradata’s recent enhancements in query performance, security integrations, and ANSI SQL coverage and shares the roadmap for 2017 and beyond.

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA

kbajda

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. One key feature in Presto is the ability to query data where it lives via a uniform ANSI SQL interface. Presto’s connector architecture creates an abstraction layer for anything that can be expressed in a row-like format, such as HDFS, Amazon S3, Azure Storage, NoSQL stores, relational databases, Kafka streams and even proprietary data stores. Furthermore, a single Presto query can combine data from multiple sources, allowing for analytics across your entire organization. This talk will be co-presented by Facebook and Teradata, the two largest contributors to Presto. The talk will focus on Presto’s ability to query virtually any data source via it’s connector interface. Facebook and Teradata will present some of their use cases of Presto querying various data sources, discuss the existing connectors in Presto, and describe the anatomy of a connector.

Presto: SQL-on-anything

DataWorks Summit

Presto, an open source distributed SQL engine originally built at Facebook, has a rapidly growing community of developers and users. In this talk, speakers from both Facebook and Teradata, will discuss technical details of some of the recent developments such as integration with Hadoop ecosystem (YARN/Slider and Ambari), security features (Kerberos), enabling BI tools via JDBC/ODBC drivers, new connectors (Redis, MongoDB) and storage engines (Raptor) as well as improvements in performance and ANSI SQL coverage. In addition, we will present a few use cases and major new users that leverage interactive SQL capabilities Presto offers. Finally, we will present our roadmap for the next year. See the video at https://youtu.be/wMy3LXuTb0U

Presto at Hadoop Summit 2016

kbajda

Presto @ Facebook: Past, Present and Future

DataWorks Summit

One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface. Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.

Presto - SQL on anything

Grzegorz Kokosiński

How to ensure Presto scalability  in multi use case

Kai Sasaki

Optimizing Presto Connector on Cloud Storage

Kai Sasaki

Hive, Presto, and Spark on TPC-DS benchmark

Dongwon Kim

Viewers also liked (9)

Facebook Presto presentation

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA

Presto: SQL-on-anything

Presto at Hadoop Summit 2016

Presto @ Facebook: Past, Present and Future

Presto - SQL on anything

How to ensure Presto scalability  in multi use case

Optimizing Presto Connector on Cloud Storage

Hive, Presto, and Spark on TPC-DS benchmark

Similar to Presto: Distributed sql query engine

Real time analytics at uber @ strata data 2019

Zhenxiao Luo

Jack Gudenkauf sparkug_20151207_7

Jack Gudenkauf

ETL, ELT and Lambda architectures have evolved into a [non]Streaming general purpose data ingestion pipeline, that is scalable through distributed processing, for Big Data Analytics over hybrid Data Warehouses in Hadoop and MPP Columnar stores like HPE-Vertica. Bio: Jack Gudenkauf (https://www.linkedin.com/in/jackglinkedin) has over twenty-nine years of experience designing and implementing Internet scale distributed systems. Jack is currently the CEO & Founder of the startup BigDataInfra. He was previously; VP of Big Data at Playtika, a hands-on manager of the Twitter Analytics Data Warehouse team, spent 15 years at Microsoft shipping 15 products, and prior to Microsoft he managed his own consulting company after he began his career as an MIS Director of several startup companies.

A noETL Parallel Streaming Transformation Loader using Spark, Kafka & Vertica

Data Con LA

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Luan Moreno Medeiros Maciel

Real time analytics on deep learning @ strata data 2019

Zhenxiao Luo

How Java 19 Influences the Future of Your High-Scale Applications .pdf

Ana-Maria Mihalceanu

ChakraCore - JSConf Last Call

Gaurav Seth

Webinar september 2013

Marc Gille

One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications. In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics. They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.

Building Continuous Application with Structured Streaming and Real-Time Data ...

Databricks

Presentation-QRUA

Ashwini Sarode

In this deck from the Stanford HPC Conference, Ryan Quick from Providentia Worldwide describes how DNNs can be used to improve EDA simulation runs. "Systems Intelligence relies on a variety of methods for providing insight into the core mechanisms for driving automated behavioral changes in self-healing command and control platforms. This talk reports on initial efforts with leveraging Semiconductor Electronic Design Automation (EDA) telemetry data from cross-domain sources including power, network, storage, nodes, and applications in neural networks as a driving method for insight into SI automation systems." Watch the video: https://youtu.be/2WbR8tq-XbM Learn more: http://www.providentiaworldwide.com/ and http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/ Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

HPC Impact: EDA Telemetry Neural Networks

inside-BigData.com

Presto: Query Anything - Data Engineer’s perspective

Alluxio, Inc.

Building an open source high performance data analytics platform

supun06

Asko Oja Moskva Architecture Highload

Ontico

Understanding the Single Thread Event Loop

TorontoNodeJS

Zeppelin at Twitter

Prasad Wagle

Web application

RajivKumarSingh27

Edge computing and the Internet of Things bring great promise, but often just getting data from the edge requires moving mountains. Let's learn how to make edge data ingestion and analytics easier using StreamSets Data Collector edge, an ultralight, platform independent and small-footprint Open Source solution written in Go for streaming data from resource-constrained sensors and personal devices (like medical equipment or smartphones) to Apache Kafka, Amazon Kinesis and many others. This talk includes an overview of the SDC Edge main features, supported protocols and available processors for data transformation, insights on how it solves some challenges of traditional approaches to data ingestion, pipeline design basics, a walk-through some practical applications (Android devices and Raspberry Pi) and its integration with other technologies such as Streamsets Data Collector, Apache Kafka, Apache Hadoop, InfluxDB and Grafana. The goal here is to make attendees ready to quickly become IoT data intake and SDC Edge Ninjas. Speaker Guglielmo Iozzia, Big Data Delivery Manager, Optum (United Health)

Ultralight Data Movement for IoT with SDC Edge

DataWorks Summit

Rajeev_Resume

Rajeev Bhatnagar

Automatically partitioning packet processing applications for pipelined archi...

Ashley Carter

Similar to Presto: Distributed sql query engine (20)

Real time analytics at uber @ strata data 2019

Jack Gudenkauf sparkug_20151207_7

A noETL Parallel Streaming Transformation Loader using Spark, Kafka & Vertica

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Real time analytics on deep learning @ strata data 2019

How Java 19 Influences the Future of Your High-Scale Applications .pdf

ChakraCore - JSConf Last Call

Webinar september 2013

Building Continuous Application with Structured Streaming and Real-Time Data ...

Presentation-QRUA

HPC Impact: EDA Telemetry Neural Networks

Presto: Query Anything - Data Engineer’s perspective

Building an open source high performance data analytics platform

Asko Oja Moskva Architecture Highload

Understanding the Single Thread Event Loop

Zeppelin at Twitter

Web application

Ultralight Data Movement for IoT with SDC Edge

Rajeev_Resume

Automatically partitioning packet processing applications for pipelined archi...

Recently uploaded

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

Intrigued by why some of the world's largest companies (Netflix, Google, Cisco, Twitter, Uber etc) are using gRPC? In this demo based talk we delve into the world of gRPC in .Net, what it does and why we should use it. We compare the interface with both Rest and graphQL. We will show you how to implement grpc server-side in .net and in the web. Finally, I will show you how the tooling helps you deliver powerful interfaces and interact with them quickly and simply.

Demystifying gRPC in .Net by John Staveley

John Staveley

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

IoT Analytics Company Presentation May 2024

IoTAnalytics

I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors. Why this matters: The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions. Highlights: ▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain ▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing ▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond ▪ Future of Work: How automation will redefine jobs and economic structures by 2040 With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Peter Udo Diehl

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Unlock the mysteries of successful Salesforce interviews in this insightful session hosted by Hugo Rosario (Salesforce Customer), a seasoned hiring manager that leads the Salesforce Department of multinational company with over 100 interviews under their belt. Step into the manager's chair and gain exclusive behind-the-scenes insights into what makes a Salesforce consultant stand out during the interview process. From deciphering the unspoken cues to mastering key strategies, we'll explore the intricacies of the interview process and provide practical tips for consultants looking to not only pass interviews but also thrive in their roles. Whether you're a seasoned professional or just starting your Salesforce journey, this session is your backstage pass to the secrets that hiring managers wish you knew.

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

CzechDreamin

Welcome to UiPath Test Automation using UiPath Test Suite series part 2. In this session, we will cover API test automation along with a web automation demo. Topics covered: Test Automation introduction API Example of API automation Web automation demonstration Speaker Pathrudu Chintakayala, Associate Technical Architect @Yash and UiPath MVP Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

UiPath Test Automation using UiPath Test Suite series, part 2

DianaGray10

The standard Salesforce Approval process can be limiting in many ways, especially in complex scenarios. What if there was a way to implement very flexible approvals where one can use Apex code to make data updates in unrelated records, dynamically generate next steps details, and compute assignees on the fly? And still use UI-based configurations to implement concrete approval processes. In this session, we will share ideas behind such a solution and show a few lines of code to get you started.

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

CzechDreamin

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

You’ve heard good data matters in Machine Learning, but does it matter for Generative AI applications? Corporate data often differs significantly from the general Internet data used to train most foundation models. Join me for a demo on building an open source RAG (Retrieval Augmented Generation) stack using Milvus vector database for Retrieval, LangChain, Llama 3 with Ollama, Ragas RAG Eval, and optional Zilliz cloud, OpenAI.

Introduction to Open Source RAG and RAG Evaluation

Zilliz

Screen flow is a powerful automation tool that is commonly designed for internal and external users. However, what about the guest users? We will dive into various methods of launching screen flows and understand how to make them publicly accessible, extending their usability to a broader audience. The presentation will also cover the implementation of security layers and highlight best practices for a smooth and protected user experience. Discover the potential of screen flows beyond conventional use and learn how to leverage them effectively.

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade

CzechDreamin

Bits & Pixels using AI for Good.........

Alison B. Lowndes

How to differentiate Sales Cloud and CPQ on first glance might be tricky if you do not know where to look and what to look at. You will know :-) Managing the sales process within Salesforce is a common use case that can be managed with standart Sales Cloud. If you want to do entire quoting process you will find out Salesforce CPQ solution exists. What is then the difference if both can handle selling products? You will see comparison of 10 different features, which Sales Cloud and Salesforce CPQ handle differently. Simple question you will always remember if you should consider using Salesforce CPQ will be a cherry on top.

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

CzechDreamin

Generative AI architecture, at its core, revolves around the concept of machines being able to generate content autonomously, mimicking human-like creativity and decision-making processes. Unlike traditional AI systems that rely on predefined rules and data inputs, generative AI leverages deep learning techniques to produce new, original outputs based on patterns and examples it has learned from vast datasets. This capability opens up a multitude of possibilities across various domains within an enterprise.

The architecture of Generative AI for enterprises.pdf

alexjohnson7307

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

New customer? New industry? New cloud? New team? A lot to handle! How to ensure the success of the project? Start it well! I've created the 3 areas of focus at the beginning of the project that helped me in multiple roles (BA, PO, and Consultant). Learn from real-world experiences and discover how these insights can empower you to deliver unparalleled value to your customers right from the project's start.

Powerful Start- the Key to Project Success, Barbara Laskowska

CzechDreamin

In this session, we will showcase how to revolutionize automated testing for your software, automation, and QA teams with UiPath Test Suite. In part 1 of UiPath test automation using UiPath Test Suite – developer series, we will cover, Software testing overview What is software testing Why software testing is required Typical test types and levels Continuous testing and challenges Introduction to UiPath Test Suite UiPath Test Suite family of products Speaker: Atul Trikha, Chief Technologist & Solutions Architect, Peraton and UiPath MVP Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

UiPath Test Automation using UiPath Test Suite series, part 1

DianaGray10

Agentic RAG What it is its types applications and implementation.pdf

ChristopherTHyatt

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Demystifying gRPC in .Net by John Staveley

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

How world-class product teams are winning in the AI era by CEO and Founder, P...

IoT Analytics Company Presentation May 2024

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...

UiPath Test Automation using UiPath Test Suite series, part 2

Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Introduction to Open Source RAG and RAG Evaluation

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade

Bits & Pixels using AI for Good.........

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

The architecture of Generative AI for enterprises.pdf

Search and Society: Reimagining Information Access for Radical Futures

Powerful Start- the Key to Project Success, Barbara Laskowska

UiPath Test Automation using UiPath Test Suite series, part 1

Agentic RAG What it is its types applications and implementation.pdf

Presto: Distributed sql query engine

1. PRESTO Kiran Palaka

2. Problem to solve  Huge production of data.  As data is growing enormously to the point of peta bytes , querying the database has become a big issue.  So we should be able to run more interactive queries and get results faster .

3. Introduction  Presto is a open source distributed sql query engine.  For running queries against of all sizes ranging from gigabytes to petabytes .  It supports ANSI SQL ,including complex queries,aggresgations,joins and window functions .  It is implemented in java.

4. Presto: I can query

5. Architecture

6. Architecture Explanation  Client sends sql to presto coordinator.  Coordinator parses ,analyzes and plans the query execution.  The scheduler wires together the execution pipeline ,assigns work to nodes closest to data and monitors the progress.  The client pulls the data from output stage which in turn pulls data from underlying stages.

7. Hive/Mapreduce Execution model  Hive translates queries into multiple stage of mapreduce tasks and execute them one after the other.  Each task reads input from disk and writes intermediate output back to disk.

8. Presto Execution  Presto engine does not use Mapreduce.  It employs a custom query and execution engine with operators designed to support sql semantics.  Processing is in memory and pipelined across the network between stages which avoids unnecessary I/O and associated latency overhead.  Pipelined execution model runs multiple stages at once and streams data from one stage to next as it becomes available which reduces end-to-end latency

9. Note  Presto dynamically compiles certain portions of query plan to byte code which lets JVM optimize and generate native machine code.

10. Extensibility  Presto was designed with a simple storage abstraction that makes its easy to provide sql query capability against disparate data sources.  Connectors only need to provide interfaces for fetching meta data, getting data locations and accessing data itself.

11. Limitations  Size limitation on the join tables and cardinality of unique groups.  Lacks the ability to write output back to tables. Currently query results are streamed to client.

12. Presto developers claim:  Presto is 10x better than hive/Mapreduce in terms of cpu efficiency and latency for most queries.  Supports ANSI sql, including joins, left/right outer joins,subqueries,most of the common aggregate and scalar functions, including approximate distinct counts, approximate percentiles

Presto: Distributed sql query engine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Presto: Distributed sql query engine

Similar to Presto: Distributed sql query engine (20)

Recently uploaded

Recently uploaded (20)

Presto: Distributed sql query engine