Presto overview

•Download as PPTX, PDF•

19 likes•8,422 views

Shixiong Zhu

Overview

Register
Ask active nodes

Discovery
Server

Coordinator
SQL
SQL
QueryInfo

SQLQueryManager

QueryResults
NextUri
CLI

SQLQueryExecution

StatementResource

QueryStarter

…

HttpRemoteTask
Fetch Data

QueryResults

Coordinator

Partial Data

OutputReceiver

Worker

SubPlan
ExchangeNode

AggregationNode(FINAL)

Plan
TableScanNode

OutputNode

FilterNode
SubPlan

AggregationNode

TableScanNode
FilterNode

OutputNode
AggregationNode(PARTIAL)
SinkNode

SubPlan

T: TableScanNode
A: AggregationNode
E: ExchangeNode

E

E

A(FINAL)

A(FINAL)

Plan
T

JoinNode
T
OutputNode

A

A
SubPlan

SubPlan

T

JoinNode

T

A(PARTIAL)

A(PARTIAL)

SinkNode

SinkNode

OutputNode

Stage

Task

Worker

Results

Stage
Stage

Worker

Coordinator

Worker

Worker

Worker

LocalExecutionPlan

SubPlan

Node1

Op1

Node2

Op2

Node3

Op3

…

…

Node3

Opn

LocalExecutionPlan

SubPlan
Node1

Node2

LocalExecutionPlan

Op1

Op2

SourceHash
JoinNode

HashJoinOperator

Node3

Op3

HashBuilderOperator

Page(max page size: 1MB, max rows:
16 * 1024 )

Row

Block

Slice
A byte array

Block

Block

Block

Block

Split

Split

Split

Split

Split

Split

Split
Is the data
ready?

Register a
callback

N
When the data of this
Split is ready, put the
Split back.

Y
Fetch one Page

Execute
Operator
Y

Has next
Operator?

N

N

TaskExecutor

Is the Split
done?

Thread number = core nubmer * 4

Y

Y

N
Time's up?

Execution Operators
Op1

page = op1.getOutput
op2.addInput(page)

Op2
page = op2.getOutput
op3.addInput(page)
Op3

…

Opn

Input
TableScanOperator

HiveSplit

DataStreamManager
RecordSetDataS
treamProvider
RecordProjectOperator

ConnectorData
StreamProvider
HiveRecordSet

HiveClient

ConnectorData
StreamProvider

HiveSplit

InputFormat

RecordReader
HiveRecordSet

Lines

TableScanOperator
RecordProjectOperator
Page
Next Operator

Load Balance
NodeMap

Split

Map: Rack -> Nodes

NodeSelector

NodeScheduler
Map: Host -> Nodes

Map: Host:Port -> Nodes

Node

NodeSelector.selectNode
• Select acceptable nodes (as least 10 nodes by
default)
– Nodes has the same address
– If not enough, add nodes in the same rack
– If not enough, randomly select nodes in other
racks

• Select the node with the smallest number of
assignments (pending tasks)

Output
• Only has SELETE statement
– Currently query results are streamed to the client

Communication
• Protocol: HTTP
• Data Format: JSON
• Every instance has one server and one client

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet. At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it. We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

Parquet performance tuning: the missing guide

Ryan Blue

Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: * The tools and techniques Netflix uses to analyze Parquet tables * How to spot common problems * Recommendations for Parquet configuration settings to get the best performance out of your processing platform * The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform

A Deep Dive into Query Execution Engine of Spark SQL

Databricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

Optimizing Apache Spark SQL Joins

Databricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

The Apache Spark File Format Ecosystem

Databricks

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved. What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data? What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output? When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency? How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions. These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."

Apache Calcite (a tutorial given at BOSS '21)

Julian Hyde

Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial (given at BOSS '21 in Copenhagen as part of VLDB '21) the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research. Presenters: Julian Hyde and Stamatis Zampetakis

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Databricks

Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).

Presto

Knoldus Inc.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

Presto on Apache Spark: A Tale of Two Computation Engines

Databricks

The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...

Altinity Ltd

Presented at the webinar, July 31, 2019 Built-in replication is a powerful ClickHouse feature that helps scale data warehouse performance as well as ensure high availability. This webinar will introduce how replication works internally, explain configuration of clusters with replicas, and show you how to set up and manage ZooKeeper, which is necessary for replication to function. We'll finish off by showing useful replication tricks, such as utilizing replication to migrate data between hosts. Join us to become an expert in this important subject!

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

Databricks

This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.

Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance

Altinity Ltd

Webinar, April 29, 2020 ClickHouse clusters apply the power of dozens or even hundreds of nodes to vast datasets. In this webinar we'll show you how to use the basic tools of replication and sharding to create high performance ClickHouse clusters. We'll study the plumbing of inserts into sharded datasets and how to determine the correct number of shards for your desired writes. We'll similarly look at distributed queries and show how to scale read capacity to desired levels using replicas. Finally, we'll look at techniques for scaling up both shards and replicas to accommodate growth in your dataset.

Understanding Query Plans and Spark UIs

Databricks

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Data Quality With or Without Apache Spark and Its Ecosystem

Databricks

Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.

ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO

Altinity Ltd

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Databricks

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

How to use Parquet as a basis for ETL and analytics

Julien Le Dem

Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.

Getting Started with Apache Spark on Kubernetes

Databricks

Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward

Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto Meetup @ Facebook (3/22/2016)

Martin Traverso

What's hot

HyperLogLog in Hive - How to count sheep efficiently?

bzamecnik

Iceberg + Alluxio for Fast Data Analytics

Alluxio, Inc.

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Databricks

Designing Structured Streaming Pipelines—How to Architect Things Right

Databricks

Apache Calcite (a tutorial given at BOSS '21)

Julian Hyde

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Databricks

Presto

Knoldus Inc.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Presto on Apache Spark: A Tale of Two Computation Engines

Databricks

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...

Altinity Ltd

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

Databricks

Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance

Altinity Ltd

Understanding Query Plans and Spark UIs

Databricks

Data Quality With or Without Apache Spark and Its Ecosystem

Databricks

ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO

Altinity Ltd

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Databricks

How to use Parquet as a basis for ETL and analytics

Julien Le Dem

Getting Started with Apache Spark on Kubernetes

Databricks

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

What's hot (20)

HyperLogLog in Hive - How to count sheep efficiently?

Iceberg + Alluxio for Fast Data Analytics

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Designing Structured Streaming Pipelines—How to Architect Things Right

Apache Calcite (a tutorial given at BOSS '21)

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

Presto

Evening out the uneven: dealing with skew in Flink

Presto on Apache Spark: A Tale of Two Computation Engines

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance

Understanding Query Plans and Spark UIs

Data Quality With or Without Apache Spark and Its Ecosystem

ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

How to use Parquet as a basis for ETL and analytics

Getting Started with Apache Spark on Kubernetes

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Real-time Analytics with Trino and Apache Pinot

Viewers also liked

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto Meetup @ Facebook (3/22/2016)

Martin Traverso

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Martin Traverso

Presto meetup 2015-03-19 @Facebook

Treasure Data, Inc.

Introduction to Kafka Streams

Guozhang Wang

Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.

Facebook Presto presentation

Cyanny LIANG

Viewers also liked (6)

Presto @ Facebook: Past, Present and Future

Presto Meetup @ Facebook (3/22/2016)

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto meetup 2015-03-19 @Facebook

Introduction to Kafka Streams

Facebook Presto presentation

Similar to Presto overview

Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)

Flink Forward

We present a web service named FLOW to let users do FLink On Web. FLOW aims to minimize the effort of handwriting streaming applications similar in spirit to Hortonworks Stream Analytics Manager, StreamAnalytix, and Nussknacker by letting users drag and drop graphical icons representing streaming operators on GUI. FLOW builds on Flink Table API and lets users assemble graphical icons associated with not only basic SQL operations but also advanced SQL operations like window aggregation, temporal join, and pattern recognition (MATCH_RECOGNIZE clause). Its data preview function enables to observe how sample data changes before and after applying each operation on screen. In addition, FLOW shows the sample data as time-series charts and geographical maps by interacting with Elasticsearch and Kibana. Therefore, domain experts with basic knowledge of SQL can design their streaming applications easily on GUI without understanding of Flink DataStream API and Flink CEP library. In this talk, we first present what motivates the development of FLOW, then show how FLOW can be used to figure out the "Popular Places" exercise in its own style, and lastly explain how FLOW leverages Flink Table API.

Do Flink on Web with FLOW

Dongwon Kim

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...

Data Con LA

Exactly Once Semantics Revisited (Jason Gustafson, Confluent) Kafka Summit NY...

confluent

Two years ago, we helped to contribute a framework for exactly once semantics (or EOS) to Apache Kafka. This much-needed feature brought transactional guarantees to stream processing engines such as Kafka Streams. In this talk, we will recount the journey since then and the lessons we have learned as usage has gradually picked up steam. What did we get right and what did we get wrong? Most importantly, we will discuss how the work is continuing to evolve in order to provide more reliability and better performance. This talk assumes basic familiarity with Kafka and the log abstraction. What you will get out of it is a deeper understanding of the underlying architecture of the EOS framework in Kafka, what its limitations are, and how you can use it to solve problems.

Apache Flink internals

Kostas Tzoumas

Second Level Cache in JPA Explained

Patrycja Wegrzynowicz

Flink internals web Kostas Tzoumas

The Stream Processor as a Database Apache Flink

DataWorks Summit/Hadoop Summit

The Stream Processor as the Database - Apache Flink @ Berlin buzzwords

Stephan Ewen

Unified stateful big data processing in Apache Beam (incubating)

Aljoscha Krettek

Apache Beam lets you process unbounded, out-of-order, global-scale data with portable high-level pipelines, but not all use cases are pipelines of simple “map” and “combine” operations. Aljoscha Krettek introduces Beam’s new State API, which brings scalability and consistency to fine-grained stateful processing while interoperating with Beam’s other features such as consistent event-time windowing and windowed side inputs—all while remaining portable to any Beam runner, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Aljoscha covers the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner. Examples of new use cases unlocked by Beam’s new mutable state and timers include: * Microservice-like streaming applications such as new user account verification and digital ordering * Complex aggregations that cannot easily be expressed as an efficient associative combiner * Output based on customized conditions, such as limiting to only “significant” changes in a learned model (resulting in potentially large cost savings in subsequent processing) * Fine control over retrieval and storage of intermediate values during aggregation * Reading from and writing to external systems with detailed management of the nature and size of requests

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

Ververica

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

Evention

This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing, end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.

Apache Flink Training: System Overview

Flink Forward

Delta Lake Streaming: Under the Hood

Databricks

With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users. In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.

Tech Talk: ONOS- A Distributed SDN Network Operating System

nvirters

This event takes us to the cusp of Distributed Software Development and SDN Controllers. We will be hosting Madan and Brian who have been involved in the architecture and development of ONOS (Open Network Operating System). Synopsis ONOS is a distributed SDN network operating system architected to provide performance, scale-out, resiliency, and well-defined northbound and southbound abstractions. Madan and Brian, both from ON.Lab, will start the talk with a deep-dive into ONOS architecture, including the key technical challenges that were solved to build this platform. They will also walk us through a live demo of building a SDN application on ONOS. Details: ONOS Architecture ONOS Abstractions and Modularity ONOS Distributed architecture ONOS APIs and their usage Live demo- Building a SDN app on ONOS Speaker Bios Madan Jampani, Distributed Systems Architect, ONOS Madan is Distributed Systems Architect at ON.Lab focusing on the core distributed systems problems for ONOS. Prior to joining ON.Lab in Sep 2014, Madan worked at Amazon for around 10 years. At Amazon, Madan was instrumental in building several key technologies ranging from Amazon retail ordering systems, distributed data stores and shared compute clusters for running large-scale data processing and machine learning workloads. Brian O’Connor, Lead Developer, ONOS Brian is the ONOS Application Intent Framework lead and a core developer at ON.Lab, working on ONOS and Mininet. Brian O’Connor received Bachelor’s and Master’s degrees in Computer Science from Stanford University. At Stanford, he helped develop “An Introduction to Computer Networking,” one of Stanford’s first MOOCs (Massively Open Online Courses). ABOUT ON.LAB and ONOS Open Networking Lab (ON.Lab) is a non-profit organization founded by SDN inventors and leaders from Stanford University and UC Berkeley to foster an open source community for developing tools and platforms to realize the full potential of SDN. ON.Lab brings innovative ideas from leading edge research and delivers high quality open source platforms on which members of its ecosystem and the industry can build real products and solutions. ONOS, a SDN network operating system for service provider and mission critical networks, was open sourced on Dec 5th, 2014. ONOS delivers a highly available, scalable SDN control plane featuring northbound and southbound abstractions and interfaces for a diversity of management, control, service applications and network devices. ONOS ecosystem comprises of ON.Lab, organizations who are funding and contributing to the ONOS initiative including AT&T, NTT Communications, SK Telecom, Ciena, Cisco, Ericsson, Fujitsu, Huawei, Intel, NEC; members who are collaborating and contributing to ONOS include ONF, Infoblox, SRI, Internet2, Happiest Minds, CNIT, Black Duck, Create-Net and the broader ONOS community. Learn how you can get involved with ONOS at onosproject.org.

Vertica And Spark: Connecting Computation And Data

Spark Summit

Vertica And Spark: Connecting Computation And Data

Rui Liu

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...

Spark Summit

Flink SQL: The Challenges to Build a Streaming SQL Engine

HostedbyConfluent

"Flink SQL is a powerful tool for stream processing that allows users to write SQL queries over streaming data. However, building a streaming SQL engine is not an easy task. In this session, we will explore the challenges that arise when building a modern streaming SQL engine like Flink SQL. We will discuss the following challenges and how Flink SQL resolve them: - Late Data: Handling late arrival data and guaranteeing result correctness. - Change Data Ingestion and Processing: How to ingest change data from databases in real-time and apply complex operations on the change events. - Event Ordering: Shuffle may disrupt the order of data updates and get the wrong result. - Nondeterminism: Nondeterministic functions and external system lookups may produce different results on change data and get the wrong result. - State Storage: How to effectively process infinite datasets with limited storage without losing the correctness of results. We will also show real-world examples of using Flink SQL to solve common stream processing problems. By the end of this session, you will better understand the challenges involved in building a streaming SQL engine and how to overcome them."

Flink meetup

Frank McSherry

Similar to Presto overview (20)

Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)

Do Flink on Web with FLOW

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...

Exactly Once Semantics Revisited (Jason Gustafson, Confluent) Kafka Summit NY...

Apache Flink internals

Second Level Cache in JPA Explained

Flink internals web

The Stream Processor as a Database Apache Flink

The Stream Processor as the Database - Apache Flink @ Berlin buzzwords

Unified stateful big data processing in Apache Beam (incubating)

Aljoscha Krettek - Portable stateful big data processing in Apache Beam

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

Apache Flink Training: System Overview

Delta Lake Streaming: Under the Hood

Tech Talk: ONOS- A Distributed SDN Network Operating System

Vertica And Spark: Connecting Computation And Data

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...

Flink SQL: The Challenges to Build a Streaming SQL Engine

Flink meetup

Recently uploaded

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Neo4j

Leonard Jayamohan, Partner & Generative AI Lead, Deloitte This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

GridMate - End to end testing is a critical piece to ensure quality and avoid...

ThomasParaiso2

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

Recently uploaded (20)

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Pushing the limits of ePRTC: 100ns holdover for 100 days

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

The Art of the Pitch: WordPress Relationships and Sales

Microsoft - Power Platform_G.Aspiotis.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Uni Systems Copilot event_05062024_C.Vlachos.pdf

20240605 QFM017 Machine Intelligence Reading List May 2024

UiPath Test Automation using UiPath Test Suite series, part 5

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Securing your Kubernetes cluster_ a step-by-step guide to success !

GridMate - End to end testing is a critical piece to ensure quality and avoid...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

By Design, not by Accident - Agile Venture Bolzano 2024

UiPath Test Automation using UiPath Test Suite series, part 4

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Presto overview

1. Presto Overview Shixiong Zhu

2. Overview Register Ask active nodes Discovery Server

3. Coordinator SQL SQL QueryInfo SQLQueryManager QueryResults NextUri CLI SQLQueryExecution StatementResource QueryStarter … HttpRemoteTask Fetch Data QueryResults Coordinator Partial Data OutputReceiver Worker

4. SubPlan ExchangeNode AggregationNode(FINAL) Plan TableScanNode OutputNode FilterNode SubPlan AggregationNode TableScanNode FilterNode OutputNode AggregationNode(PARTIAL) SinkNode

5. SubPlan T: TableScanNode A: AggregationNode E: ExchangeNode E E A(FINAL) A(FINAL) Plan T JoinNode T OutputNode A A SubPlan SubPlan T JoinNode T A(PARTIAL) A(PARTIAL) SinkNode SinkNode OutputNode

6. Stage Task Worker Results Stage Stage Worker Coordinator Worker Worker Worker

7. Worker

8. LocalExecutionPlan SubPlan Node1 Op1 Node2 Op2 Node3 Op3 … … Node3 Opn

9. LocalExecutionPlan SubPlan Node1 Node2 LocalExecutionPlan Op1 Op2 SourceHash JoinNode HashJoinOperator Node3 Op3 HashBuilderOperator

10. Page(max page size: 1MB, max rows: 16 * 1024 ) Row Block Slice A byte array Block Block Block Block

11. Split Split Split Split Split Split Split Is the data ready? Register a callback N When the data of this Split is ready, put the Split back. Y Fetch one Page Execute Operator Y Has next Operator? N N TaskExecutor Is the Split done? Thread number = core nubmer * 4 Y Y N Time's up?

12. Execution Operators Op1 page = op1.getOutput op2.addInput(page) Op2 page = op2.getOutput op3.addInput(page) Op3 … Opn

13. Input TableScanOperator HiveSplit DataStreamManager RecordSetDataS treamProvider RecordProjectOperator ConnectorData StreamProvider HiveRecordSet HiveClient ConnectorData StreamProvider

14. HiveSplit InputFormat RecordReader HiveRecordSet Lines TableScanOperator RecordProjectOperator Page Next Operator

15. Load Balance NodeMap Split Map: Rack -> Nodes NodeSelector NodeScheduler Map: Host -> Nodes Map: Host:Port -> Nodes Node

16. NodeSelector.selectNode • Select acceptable nodes (as least 10 nodes by default) – Nodes has the same address – If not enough, add nodes in the same rack – If not enough, randomly select nodes in other racks • Select the node with the smallest number of assignments (pending tasks)

17. Output • Only has SELETE statement – Currently query results are streamed to the client

18. Communication • Protocol: HTTP • Data Format: JSON • Every instance has one server and one client

19. Q&A

Editor's Notes

A SubPlan will to convert by LocalExecutionPlanner to LocalExecutionPlan which has a operator sequence.
HashJoinOperator andHashBuilderOperator is connected by SourceHash which contains the output of HashBuilderOperator.
You can image Slice is a byte array. The Slice size is the array size. The Block size is the Slice size. The Page size is sum of all the Block sizes.
Every Split is only allowed to execute 1s by default. When the time is up, the split will be put back to the queue.
RecordSetDataStreamProvider is a subclass of ConnectorDataStreamProvider.
When DiscoveryNodeManager receives any Node information query, it will check if the cache is expired (5 seconds).If so, it will ask the ServiceSelectorto fetch the active nodes and drop the failure nodes.ServiceSelector will fetch the new node list from the Discovery Server every 10s by default.There is a thread in HeartbeatFailureDetector which will send the heartbeat to every active node 500ms by default.

Presto overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Presto overview

Similar to Presto overview (20)

Recently uploaded

Recently uploaded (20)

Presto overview

Editor's Notes