This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Dynamic filtering for presto join optimisationOri Reshef
@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Dynamic filtering for presto join optimisationOri Reshef
@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Understanding Spark Tuning: Strata New YorkRachel Warren
How to design a Spark Auto Tuner.
The first section coves how to set basic Spark settings e.g. executor memory, driver memory, dynamic allocation, shuffle settings, number of partitions etc. The second section it covers how to collect historical data about a spark Job and the third section discusses designing an auto tuner application which will programmatically configure Spark jobs using that historical data.
Apache Spark is an amazing distributed system, but part of the bargain we’ve made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Tuning Apache Spark is somewhat of a dark art, although thankfully, when it goes wrong, all we tend to lose is several hours of our day and our employer’s money.
Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using both historical and live job information, using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Much of the data required to effectively tune jobs is already collected inside of Spark. You just need to understand it. Holden, Rachel, and Anya outline sample auto-tuners and discuss the options for improving them and applying similar techniques in your own work. They also discuss what kind of tuning can be done statically (e.g., without depending on historic information) and look at Spark’s own built-in components for auto-tuning (currently dynamically scaling cluster size) and how you can improve them.
Even if the idea of building an auto-tuner sounds as appealing as using a rusty spoon to debug the JVM on a haunted supercomputer, this talk will give you a better understanding of the knobs available to you to tune your Apache Spark jobs.
Also, to be clear, Holden, Rachel, and Anya don’t promise to stop your pager going off at 2:00am, but hopefully this helps.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
In the data analytics space none can argue that Spark has become the preferred tool for the Data Scientist, Business Analyst and for the Developer. At Intel, Spark is widely used across the organization to interact with Hive, to process streaming data, to ingest data from diverse sources to be used in machine learning or data analytics. In this presentation, we want to share how reusable ingestion components using Spark-sql has accelerated our application development phase. We will be discussing the challenges we faced at Intel when running Spark-on-yarn applications. Also have you spent time wondering why your Spark-sql query was running very slowly or pondering different methods for ingesting data faster from an RDBMs? We will review Spark-on-yarn deployment and configuration. We will also describe the challenges posed by handling and processing large data-sets. Finally, we will share recommendations on how to tune spark jobs to optimize job performance by properly allocating resources.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
Introduction to Yaetos, an open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in the AWS cloud. Focus on the Spark component.
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
ABSTRACT: Perlmutter is the newest supercomputer at Berkeley Lab, California, and features a whopping 35 PB all-flash Lustre file system. Let's dive into its architecture, showing some early performance figures and unique performance considerations, using low-level Lustre tests that achieve over 90% of the theoretical bandwidth of the SSDs, to showcase how Perlmutter achieves the performance of a burst buffer and the resilience of a scratch file system. Lastly, some performance considerations unique to an all-flash Lustre file system, along with tips on how better I/O patterns can make the most of such powerful architectures.
BIO: Alberto Chiusole studied Data Science and Scientific Computing in Trieste when he had the opportunity to spend some months at CERN, in Geneva, benchmarking their Ceph file system against a classic Lustre file system from eXact lab, the HPC consulting company in Trieste he was working for at the time. After Trieste, he worked as a Storage and I/O Software Engineer at Berkeley Lab, California, a national scientific laboratory, where he assisted scientists with improving their I/O and data needs. He now works for Seqera Labs as an HPC DevOps Engineer, focusing on infrastructure support.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
5. Problems when writing to S3: EC
- Eventual consistency problems
- HEAD (404) -> PUT -> GET
- PUT -> PUT -> GET
- PUT -> DELETE -> LIST-PARENT
6. Problems when writing to S3: Rename
Operation:
Rename
s3://bucket/x to
s3://bucket/y
Copy x to y Delete x
- Copy is slow and depends on file size
- Two calls needed
7. Problems when writing to S3: Failures
- Transient failures of S3 rest calls
- Throttling
9. Two kinds of tables
Write
Hive table
Datasource
table
Distributed
write to hive
staging dir
Hive.loadTable /
Hive.loadPartition
called to move data
to warehouse
Distributed
write to final
dest
12. Problem: loadPartition is slow
- Hive.replaceFiles / Hive.copyFiles primitive is used to move
data from hive staging dir to warehouse dir
- Rename done in the hive operations is slow and serialized
- No retries to account for transient failures
13. Problem: loadPartition has EC issues
- EC issues during the copy/move
- Few files written to the hive staging directory may not appear in
the listing done on the driver during Hive.replaceFiles
- Few files deleted may appear in the listing (especially in FOC v1
case)
15. Solution: Robustness
- Listing related
- diff(oldListing, newListing)
- if new files appear, rename them in this iteration
- if existing files disappear, dont try to rename them
- Rename related
- if rename failed, try to rename them in next iteration
16. Solution: Performance
- Rename in parallel in a threadpool of 128 threads
- For INSERT INTO, find the N to use for file_copy_N, for all files
in the dest dir, in one shot
- Rename the biggest files to be first so that they don’t become
the long pole
- Rename the recently modified files last (FIFO on time) so that
they get time to vanish
17. Solution: Performance numbers
- INSERT OVERWRITE TABLE user PARTITION(date="2011")
SELECT userId, firstName, email FROM people
- For example, 100GB data spread over 10000 files
- Before optimization: 110 mins
- After optimization: 12 mins (not sensitive to file count)
22. Solution: Write directly to the warehouse
- Use spark’s default write flow for hive tables also
- Avoid using staging_dir
- Uses whatever OutputCommitter which is active
- Changes in spark code base
- Cases: INSERT INTO/OVERWRITE + Static/dynamic partitions
- Except INSERT OVERWRITE involving dynamic partitions
- Con: Affects warehouse directory immediately on job start
23. Solution: Write directly to the warehouse
- Very good performance gains
- Hive.loadTable / Hive.loadPartition not needed
- Error recovery needs be done carefully
- On failure, delete all files s3://bucket/path/*/*/*<jobId>*
24. Solution: Performance
- Data: 142 GB (Records - 149994000, Partitions - 9000)
- Each partition had one file
- Direct writes disabled: 7 hr, 30 min
- Direct writes enabled: 24.5 mins
- Spark distributed write was fast in both cases. In the first case
extra move was needed.
26. DirectFileOutputCommitter (DFOC)
- Directly write to output location
- Pros: No EC, high performance
- Cons: Speculation and task retries will fail
- Cons: Output is visible before job finish
27. Problem
- If you use DFOC, any task failure will cause job failure
- Empty S3 file is created even on task failure
- Retry will always fail with FileAlreadyExistsException
- 7/08/16 00:33:55 task-result-getter-1 WARN TaskSetManager: Lost task 0.1 in stage 42.0 (TID 5782, 10.23.7.190, executor 10):
org.apache.hadoop.fs.FileAlreadyExistsException:
s3n://bucket/path/2017/08/15/23/part-00000-017681ee-5206-4163-b4a9-a29cf8a67ab4.json.gz already exists
28. Solution: Overwrite if file already exists
- fs.create(path, false) -> fs.create(path, true)
- Spark changes - different across versions
- Hive changes - orc
- Parquet changes
30. Problem:
- alter table recover partitions is slow
- Algorithm
- Generate list of all partitions and their statistics
- Add partitions to metastore
- Example: Two partition keys, 100 values each, 10k partitions in
total - takes close to (10+20) mins to recover partitions (spark
2.1.0)
31. Solution
- Use faster variant of S3 listing, prefix based
- 10 mins for gathering partitions and stats reduced to 10 secs
- Now total time is (10 secs + 20 mins), 33% improvement
- Spark only changes