Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
This is a presentation deck for Spark AI Summit 2020 at
https://databricks.com/session_na20/sql-performance-improvements-at-a-glance-in-apache-spark-3-0
Presentation slide for "In-Memory Storage Evolution in Apache Spark" at Spark+AI Summit 2019
https://databricks.com/session/in-memory-storage-evolution-in-apache-spark
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
This is a presentation deck for Spark AI Summit 2020 at
https://databricks.com/session_na20/sql-performance-improvements-at-a-glance-in-apache-spark-3-0
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
Apache Spark achieves high performance with ease of programming due to a well-balanced design between ease of usage of APIs and the state-of-the-art runtime optimization. In Spark 1.3, DataFrame API was introduced to write a SQL-like program in a declarative manner. It can achieve superior performance by leveraging advantages in Project Tungsten. In Spark 1.6, Dataset API was introduced to write a generic program, such as machine learning in a functional manner. It was also designed to achieve superior performance by reusing the advantages in Project Tungsten. The differences between DataFrame and Dataset are not fully understood in the community, and it is worth understanding these differences because it is becoming popular to write programs in Dataset and for a transition of programs from RDD to Dataset.
This session will explore the differences between DataFrame and Dataset using programs that performs the same operations (e.g. filter()). Dr. Ishizaki will give several comparisons from levels of source code, SQL execution plans, SQL optimizations, generated Java code, data representations and runtime performance. He will show performance difference of the programs between DataFrame and Dataset, and will identify the cause of the difference. He will also explain opportunities and approaches to improve performance of Dataset programs by alleviating some of issues.
Learn to understand the differences between DataFrame and Dataset from several views; get to know performance differences of programs, which perform the same computation, by using the DataFrame API and the Dataset API; and understand opportunities to improve performance of programs in the Dataset API.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Intro to PySpark: Python Data Analysis at scale in the CloudDaniel Zivkovic
Why would you care? Because PySpark is a cloud-agnostic analytics tool for Big Data processing, "hidden" in:
* AWS Glue - Managed ETL Service
* Amazon EMR - Big Data Platform
* Google Cloud Dataproc - Cloud-native Spark and Hadoop
* Azure HDInsight - Microsoft implementation of Apache Spark in the cloud
In this #ServerlessTO talk, Jonathan Rioux - Head of Data Science at EPAM Canada & author of PySpark in Action book (https://www.manning.com/books/pyspark-in-action), will get you acquainted with PySpark - Python API for Spark.
Event details: https://www.meetup.com/Serverless-Toronto/events/269124392/
Event recording: https://youtu.be/QGxytMbrjGY
Like always, BIG thanks to our knowledge sponsor Manning Publications – who generously offered to raffle not 1 but 3 of Jonathan's books!
RSVP for more exciting (online) events at https://www.meetup.com/Serverless-Toronto/events/
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Demystifying DataFrame and Dataset with Kazuaki IshizakiDatabricks
Apache Spark achieves high performance with ease of programming due to a well-balanced design between ease of usage of APIs and the state-of-the-art runtime optimization. In Spark 1.3, DataFrame API was introduced to write a SQL-like program in a declarative manner. It can achieve superior performance by leveraging advantages in Project Tungsten. In Spark 1.6, Dataset API was introduced to write a generic program, such as machine learning in a functional manner. It was also designed to achieve superior performance by reusing the advantages in Project Tungsten. The differences between DataFrame and Dataset are not fully understood in the community, and it is worth understanding these differences because it is becoming popular to write programs in Dataset and for a transition of programs from RDD to Dataset.
This session will explore the differences between DataFrame and Dataset using programs that performs the same operations (e.g. filter()). Dr. Ishizaki will give several comparisons from levels of source code, SQL execution plans, SQL optimizations, generated Java code, data representations and runtime performance. He will show performance difference of the programs between DataFrame and Dataset, and will identify the cause of the difference. He will also explain opportunities and approaches to improve performance of Dataset programs by alleviating some of issues.
Learn to understand the differences between DataFrame and Dataset from several views; get to know performance differences of programs, which perform the same computation, by using the DataFrame API and the Dataset API; and understand opportunities to improve performance of programs in the Dataset API.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
2. What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
2
val ds = ...
val ds1 = ...
3. What is Apache Spark?
▪ Framework that processes distributed computing by transforming
distributed immutable memory structure using a set of parallel operations
▪ e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
▪ RDD (Resilient Distributed Dataset), DataFrame, Dataset
– SQL-based data types are supported
– Scala is primary language for programming on Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.4 released in 2018/11
3
val ds = ...
val ds1 = ...
This talk focuses on
executor behavior
4. How Code on Each Executor is Generated?
▪ The program written as embedded DSL is translated to Java code thru
analysis and optimizations in Spark
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki4
val ds: Dataset[Array[Int]] = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0).map(a => a)
while (rowIterator.hasNext()) {
Row row = rowIterator.next;
ArrayData a = row.getArray(0);
…
}
SQL
Analyzer
Rule-based
Optimizer
Code
Generator
DataFrame
Dataset
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
Java virtual machine
Spark Program
Generated Java code
5. Motivating Example
▪ A simple Spark program that performs filter and map operations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki5
val ds = Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
(0, 2)
(1, 3)
Column 0
Row 0
Row 1
6. Motivating Example
▪ Generate complicated code from a simple Spark program
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki6
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
Note: Actually generated code is
more complicated
7. Performance Issues in Generated Code
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki7
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next();
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = Integer.valueOf(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj1[i] = Intger.valueOf(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1);
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
P1 (from columnar to row-oriented)
P3 (Boxing)
P2
P3 (Unboxing)
P3 (Boxing)
P2
P3 (Unboxing)
P2
val ds = Seq(
Array(0, 2),
Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
8. Our Contributions
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki8
These optimizations reduce
path length of handling data.
9. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki9
10. Basic Compilation Strategy of an Operator
▪ Use Volcano style [Graefe93]
– Connect operations using an iterator for easy adding of new operators
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki10
Row-based
iterator
Operator
val ds = Seq((Array(0, 2), 10, “Tokyo”))
val ds1 = ds.filter(a => a.int_ > 0)
.map(a => a)
Row-based
iterator
Operator
Array Int String
(0, 2) 10 “Tokyo”
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
11. Overview of Generated Code in Spark
▪ Put multiple operator into one loop [Neumann11] when possible
– Can avoid overhead of iterators
– Encourage compiler optimizations in a loop
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki11
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// map(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
}
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
// filter(...)
...
// map(...)
...
}
val ds1 = ds.filter(a => a(0) > 0)
.map(a => a)
Whole-stage code generationVolcano style
12. Columnar Storage to Generated Code
▪ While the source uses columnar storage, generated code requires
data in row-based storage
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki12
Columnar
Storage
Row-based
iterator
Operators
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
(0, 2) 10 “Tokyo”
Two operations
in a loop
13. Problem: Data Copy From Columnar Storage
▪ Copy a set of columns to a row occurs when the iterator is used
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki13
Columnar
Storage
Row-based
iterator
Operators
Data copy
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int x = row.getInteger(1);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
(0, 2) 10 “Tokyo”
val ds = Seq((Array(0, 2), 10, “Tokyo”),
(Array(1, 3), 20, “Mumbay”)).toDS.cache
14. Solution: Generate Optimized Code
▪ If new analysis identifies the source is columnar storage,
– Use a counter-based loop without a row-based iterator
▪ To identify a row position in a columnar storage with an index
– Get data from columnar storage directly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki14
Columnar
Storage
Operators
Column column1 = df1.getColumn(1);
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column1.getInteger(i);
...
...
}
Array Int String
(0, 2)
(1, 3)
10
20
“Tokyo”
“Mumbay”
15. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki15
16. Overview of Old Internal Array Representation
▪ Use an Object array for each element to handle NULL value for SQL in
addition to a primitive value (e.g. 0 or 2)
☺Easy to represent NULL
Use a boxed object (e.g. Integer object) to hold a primitive value (e.g. int)
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki16
[0] [1]
Object array
0 2
Integer Integer
NULL
[2]
val ds = Seq(Array(0, 2, NULL))
len = 3
17. Problem: Boxing and Unboxing Occur
▪ Cause boxing (i.e. create an object) by a setter method (e.g. setInt(1, 2))
▪ Cause unboxing by a getter method (e.g. getInt(0))
▪ Increase memory footprint by having a pointer to an object
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki17
int getInt(int i) {
return Integer.getValue(array[i]);
}
void setInt(int i, int v) {
array[i] = new Integer(v);
}
Unboxing
from Integer object to int value
Boxing
from int value to Integer object
len = 3
[0] [1]
0 2
Integer Integer
NULL
[2]
Object array
18. Solution: Use Primitive Type When Possible
▪ Keep a value in a primitive field when possible based on analysis
▪ Keep NULL in a separate bit field
☺Avoid boxing and unboxing
☺Reduce memory footprint
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki18
Non
Null
Non
Null 0
[0] [1]
int array
int getInt(int i) {
return array[i];
}
void setInt(int i, int v) {
array[i] = v;
}
len = 3 Null 2 0
[2]
bit field
19. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Improve efficiency of data representation (Data-representation)
▪ Eliminate unnecessary data conversion (Data-conversion)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki19
20. Problem: Boxing Occur
▪ When convert data representation from the Spark internal to Java object,
boxing occurs in the generated code
☺Easy to handle NULL value
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki20
ArrayData a = …;
Object[] obj0 = new Object[a.length];
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0);
int[] input_filter = array_filter.toIntArray();
Boxing
21. Solution: Use Primitive Type When Possible
▪ When the analysis identifies that the array is a primitive type array without
NULL, generate code using a primitive array.
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki21
ArrayData a = …;
int[] array = a.toIntArray();
Note: data-representation optimization improves efficiency of toIntArray()
22. Generated Code without Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki22
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Iterator inputIterator = ...;
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
while (inputIterator.hasNext()) {
Row inputRow = (Row) inputIterator.next(); // P1
ArrayData a = inputRow.getArray(0);
Object[] obj0 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj0[i] = new Integer(a.getInt(i));
ArrayData array_filter = new GenericArrayData(obj0); // P2
int[] input_filter = array_filter.toIntArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
Object[] obj1 = new Object[a.length]; // P3
for (int i = 0; i < a.length; i++)
obj1[i] = new Intger(a.getInt(i));
ArrayData array_map = new GenericArrayData(obj1); // P2
int[] input_map = array_map.toIntArray();
int[] mvalue = (double[])map_func.apply(input_map);
ArrayData value = new GenericArrayData(mvalue); // P2
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
23. Generated Code with Our Optimizations
▪ P1: Unnecessary data copy
▪ P2: Inefficient data representation
▪ P3: Unnecessary data conversions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki23
val ds =
Seq(Array(0, 2), Array(1, 3))
.toDS.cache
val ds1 = ds
.filter(a => a(0) > 0)
.map(a => a)
final class GeneratedIterator {
Column column0 = ...getColumn(0);
Row projectRow = new Row(1);
RowWriter rowWriter = new RowWriter(projectRow);
protected void processNext() {
for (int i = 0; i < column0.numRows(); i++) {
// eliminated data copy (P1)
ArrayData a = column0.getArray(i);
// eliminated data conversion (P3)
int[] input_filter = a.toDoubleArray();
boolean fvalue = (Boolean)filter_func.apply(input_filter);
if (!fvalue) continue;
// eliminated data conversion (P3)
int[] input_map = a.toDoubleArray();
int[] mvalue = (int[])map_func.apply(input_map);
// use efficient data representation (P2)
ArrayData value = new IntArrayData(mvalue);
rowWriter.write(0, value);
appendRow(projectRow);
}
}
}
24. Outline
▪ Problems
▪ Eliminate unnecessary data copy (Data-copy)
▪ Made data representation effective (Data-representation)
▪ Experiments
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki24
25. Performance Evaluation Methodology
▪ Measured performance improvement of two types of applications using
our optimizations
– Database: TPC-H
– Machine learning: Logistic regression and k-means
▪ Experimental environment used
– Five 16-core Intel Xeon E5-2683 v4 CPU (2.1 GHz with 128 GB of RAM)
machines
▪ One for driver and four for executors
– Spark 2.2
– OpenJDK 1.8.0_181 with 96GB heap using default collection policy
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki25
26. Performance Improvements of TPC-H queries
▪ Achieve up to 1.41x performance improvement
– 1.10x on geometric mean
▪ Accomplished by only data-copy optimization
– No array is used in TPC-H
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki26
1
1.1
1.2
1.3
1.4
1.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
Performanceimprovementover
nooptimization
With data-copy optimization
Scale factor=10
Higher is better
27. Performance Improvements of ML applications
▪ Achieve up to 1.42x performance improvement of logistic regression
– 1.21x on geometric mean
▪ Accomplished by optimization for array representation
– columnar storage optimization contributed slightly
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki27
1
1.1
1.2
1.3
1.4
1.5
K-means Logistic regression
Performanceimprovementover
nooptimization
Data-representation and data-conversion
optimizations
All optimizations
5M data points with 200 dimensions 32M data points with 200 dimensions
Higher is better
28. Cycle Breakdown for Logistic Regression
▪ Took 28% for data conversion without data-representation and data-
conversion optimizations
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki28
28.1
0
60.1
87.6
11.8 12.4
0%
20%
40%
60%
80%
100%
w/o optimizations w/ optimizations
Percentageofconsumedcycles
Others
Computation
Data conversion
29. Conclusion
▪ Revealed performance issues in generated code from a Spark program
▪ Devised three optimizations
– to eliminate unnecessary data copy (Data-copy)
– to improve efficiency of data representation (Data-representation)
– to eliminate unnecessary data conversion (Data-conversion)
▪ Achieved up to 1.4x performance improvements
– 22 TPC-H queries
– Two machine learning programs: logistic regression and k-means
▪ Merged these optimizations into Spark 2.3 and later versions
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki29
30. Acknowledgments
▪ Thanks Apache Spark community for suggestions on merging our
optimizations into Apache Spark, especially
– Wenchen Fan, Herman van Hovell, Liang-Chi Hsieh, Takuya Ueshin, Sameer
Agarwal, Andrew Or, Davies Liu, Nong Li, and Reynold Xin
Analyzing and Optimizing Java Code Generation for Apache Spark Query Plan / Kazuaki Ishizaki30