The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
"This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
"This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...DataStax
Advanced Apache Cassandra operations depends on an understanding of what features are available via the JMX interface. While nodetool exposes many of these, the most useful are still waiting to be discovered. The JMX interface allows the code base to expose functions that operate directly on internal structures, making real time changes to the way the process runs. With this skill in your toolkit there is no limit to the changes you can make.
In this talk Nate McCall, CTO at The Last Pickle, will explain how to explore, secure, and invoke the JMX interface exposed by Cassandra. He'll then move on to what you can do with it such as compacting specific SSTables, changing compaction on a single node, managing repairs, diagnosing latency, viewing cross node timeouts, and others. Whether you are a developer or operator, new or experienced, you will be given a thorough understanding of what all is available via JMX without having to consult the code on your own.
About the Speaker
Nate McCall CTO, The Last Pickle
Nate McCall has 16 years of server-side systems and software development experience. He started his involvement in the Cassandra community in the late fall of 2009 when he became one of the original developers on the Hector Java client. He has contributed a number of patches over the years to the Apache Cassandra code base and continues to be actively involved on the mail lists, issue system and IRC. He has been a DataStax MVP every year since the inception of the program.
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
We will see internal architecture of spark cluster i.e what is driver, worker, executor and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
Redis - for duplicate detection on real time streamCodemotion
Roberto "frank" Franchini presenta a Codemotion Techmeetup Torino Redis, un data structure server che può utilizzare come chiavi stringhe, hashes, lists, sets, sorted sets, bitmaps e hyperloglogs
.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...DataStax
Advanced Apache Cassandra operations depends on an understanding of what features are available via the JMX interface. While nodetool exposes many of these, the most useful are still waiting to be discovered. The JMX interface allows the code base to expose functions that operate directly on internal structures, making real time changes to the way the process runs. With this skill in your toolkit there is no limit to the changes you can make.
In this talk Nate McCall, CTO at The Last Pickle, will explain how to explore, secure, and invoke the JMX interface exposed by Cassandra. He'll then move on to what you can do with it such as compacting specific SSTables, changing compaction on a single node, managing repairs, diagnosing latency, viewing cross node timeouts, and others. Whether you are a developer or operator, new or experienced, you will be given a thorough understanding of what all is available via JMX without having to consult the code on your own.
About the Speaker
Nate McCall CTO, The Last Pickle
Nate McCall has 16 years of server-side systems and software development experience. He started his involvement in the Cassandra community in the late fall of 2009 when he became one of the original developers on the Hector Java client. He has contributed a number of patches over the years to the Apache Cassandra code base and continues to be actively involved on the mail lists, issue system and IRC. He has been a DataStax MVP every year since the inception of the program.
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.
In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
We will see internal architecture of spark cluster i.e what is driver, worker, executor and cluster manager, how spark program will be run on cluster and what are jobs,stages and task.
Redis - for duplicate detection on real time streamCodemotion
Roberto "frank" Franchini presenta a Codemotion Techmeetup Torino Redis, un data structure server che può utilizzare come chiavi stringhe, hashes, lists, sets, sorted sets, bitmaps e hyperloglogs
.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Slidedeck presented at http://devternity.com/ around MongoDB internals. We review the usage patterns of MongoDB, the different storage engines and persistency models as well has the definition of documents and general data structures.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
The Pregel Programming Model with Spark GraphXAndrea Iacono
GraphX is Apache Spark's API for graph distributed computing based on the Pregel programming model. In this talk we'll see a brief introduction to Pregel and then we'll focus on transforming standard graph algorithms in their distributed counterpart using GraphX to speedup performance in a distributed environment.
This slides show how to integrate with the powerful tool in big data area. When using spark to do data preprocessing then produce the training data set to scikit learn , it will cause performance issue . So i share some tips how to overcome related performance issue
John Davies: "High Performance Java Binary" from JavaZone 2015C24 Technologies
"High Performance Java Binary instead of Objects" This talk on Java compaction was delivered by John Davies from C24 Technologies at JavaZone 2015 in Oslo.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands-On Labs
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Similar to Building a modern Application with DataFrames (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
17. Spark Data Model
DataFrame with 4 partitions
logLinesDF
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Warn ts msg2
Error ts msg1
Type Time Msg
(Str
)
(Int
)
(Str
)
Info ts msg7
Warn ts msg2
Error ts msg9
Type Time Msg
(Str
)
(Int
)
(Str
)
Warn ts msg0
Warn ts msg2
Info ts msg11
Type Time Msg
(Str
)
(Int
)
(Str
)
Error ts msg1
Error ts msg3
Error ts msg1
df.rdd.partitions.size = 4
18. Spark Data Model
- -
-
Ex
DF
DF
Ex
DF
DF
Ex
DF
more partitions = more parallelism
E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DataFrame
19. 19
DataFrame Benefits
• Easier to program
• Significantly fewer Lines of Code
• Improved performance
• via intelligent optimizations and code-generation
20. Write Less Code: Compute an Average
private IntWritable one =
new IntWritable(1)
private IntWritable output =
new IntWritable()
proctected void map(
LongWritable key,
Text value,
Context context) {
String[] fields = value.split("t")
output.set(Integer.parseInt(fields[1]))
context.write(one, output)
}
IntWritable one = new IntWritable(1)
DoubleWritable average = new DoubleWritable()
protected void reduce(
IntWritable key,
Iterable<IntWritable> values,
Context context) {
int sum = 0
int count = 0
for(IntWritable value : values) {
sum += value.get()
count++
}
average.set(sum / (double) count)
context.Write(key, average)
}
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [x.[1], 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
20
21. Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1]))
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])
.map(lambda x: [x[0], x[1][0] / x[1][1]])
.collect()
Using DataFrames
sqlCtx.table("people")
.groupBy("name")
.agg("name", avg("age"))
.collect()
Full API Docs
• Python
• Scala
• Java
• R
21
22. 22
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
24. 24
DataFrames are evaluated lazily
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
2
- -
E T
E T
- -
E T
E T
- -
E T
E T
DF-
3
Distributed
Storage
or
25. Transformation examples Action examples
Transformations, Actions, Laziness
count
collect
show
head
take
filter
select
drop
intersect
join
25
DataFrames are lazy. Transformations contribute
to the query plan, but they don't execute
anything.
Actions cause the execution of the query.
27. Graduate
d from
Alpha in
1.3
Spark SQL
– Part of the core distribution since Spark 1.0 (April
2014)
SQL
27
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
27
28. 28
Which context?
SQLContext
• Basic functionality
HiveContext
• More advanced
• Superset of SQLContext
• More complete HiveQL parser
• Can read from Hive metastore
+ tables
• Access to Hive UDFs
Improved
multi-version
support in
1.4
29. Construct a DataFrame
29
# Construct a DataFrame from a "users" table in Hive.
df = sqlContext.read.table("users")
# Construct a DataFrame from a log file in S3.
df = sqlContext.read.json("s3n://someBucket/path/to/data.json",
"json")
val people = sqlContext.read.parquet("...")
DataFrame people = sqlContext.read().parquet("...")
30. Use DataFrames
30
# Create a new DataFrame that contains only "young" users
young = users.filter(users["age"] < 21)
# Alternatively, using a Pandas-like syntax
young = users[users.age < 21]
# Increment everybody's age by 1
young.select(young["name"], young["age"] + 1)
# Count the number of young users by gender
young.groupBy("gender").count()
# Join young users with another DataFrame, logs
young.join(log, logs["userId"] == users["userId"], "left_outer")
31. DataFrames and Spark SQL
31
young.registerTempTable("young")
sqlContext.sql("SELECT count(*) FROM young")
38. Creating DataFrames
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
E, T, M
E, T, M
RD
D
E, T, M
E, T, M
E, T, M
E, T, M
DF
Data Sources
39. 39
Data Sources API
• Provides a pluggable mechanism for accessing structured data
through Spark SQL
• Tight optimizer integration means filtering and column pruning
can often be pushed all the way down to data sources
• Supports mounting external sources as temp tables
• Introduced in Spark 1.2 via SPARK-3247
40. 40
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write
DataFrames
using a variety of formats.
40
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org
42. 42
DataFrames: Reading from JDBC
1.3
• Supports any JDBC compatible RDBMS: MySQL, PostGres, H2, etc
• Unlike the pure RDD implementation (JdbcRDD), this supports
predicate pushdown and auto-converts the data into a DataFrame
• Since you get a DataFrame back, it’s usable in Java/Python/R/Scala.
• JDBC server allows multiple users to share one Spark cluster
43. Read Less Data
The fastest way to process big data is to never read
it.
Spark SQL can help you read less data
automatically:
1Only supported for Parquet and Hive, more support coming in Spark 1.4 - 2Turned off by default in Spark 1.3 43
• Converting to more efficient formats
• Using columnar formats (i.e. parquet)
• Using partitioning (i.e., /year=2014/month=02/…)1
• Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
44. Fall 2012: &
July 2013: 1.0 release
May 2014: Apache Incubator, 40+
contributors
• Limits I/O: Scans/Reads only the columns that are needed
• Saves Space: Columnar layout compresses better
Logical table
representation
Row Layout
Column Layout
45. Source: parquet.apache.org
Reading:
• Readers are
first read the file
metadata to find all
column chunks they
interested in.
• The columns chunks
should then be read
sequentially.
Writing:
• Metadata is written
after the data to
allow for single pass
writing.
46. Parquet Features
1. Metadata merging
• Allows developers to easily add/remove columns in data files
• Spark will scan all metadata for files and merge the schemas
2. Auto-discover data that has been partitioned into
folders
• And then prune which folders are scanned based on
predicates
So, you can greatly speed up
queries simply by breaking up data
into folders:
47. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
47
48. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
read and write
functions create new
builders for doing I/O
48
49. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
Builder methods
specify:
• Format
• Partitioning
• Handling of
existing data
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
49
50. Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of
formats:
load(…), save(…)
or saveAsTable(…)
finish the I/O
specification
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
50
51. 51
How are statistics used to improve DataFrames performance?
• Statistics are logged when caching
• During reads, these statistics can be used to skip some
cached partitions
• InMemoryColumnarTableScan can now skip partitions that cannot
possibly contain any matching rows
- - -
9 x x
8 x x
- - -
4 x x
7 x x
- - -
8 x x
2 x x
DF
max(a)=
9
max(a)=
7
max(a)=
8
Predicate: a = 8
Reference:
• https://github.com/apache/spark/pull/1883
• https://github.com/apache/spark/pull/2188
Filters Supported:
• =, <, <=, >, >=
52. DataFrame # of Partitions after Shuffle
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
2
sqlContex.setConf(key, value)
spark.sql.shuffle.partititions
defaults to 200
Spark 1.6: Adaptive
Shuffle
Shuffle
53. Caching a DataFrame
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
- -
-E T
ME T
M
DF-
1
Spark SQL will re-encode the data into byte buffers before
calling caching so that there is less pressure on the GC.
.cache()
55. Schema Inference
What if your data file doesn’t have a schema? (e.g., You’re reading a
CSV file or a plain text file.)
You can create an RDD of a particular type and let Spark infer the
schema from that type. We’ll see how to do that in a moment.
You can use the API to specify the schema programmatically.
(It’s better to use a schema-oriented input source if you can, though.)
56. Schema Inference Example
Suppose you have a (text) file that looks like
this:
56
The file has no schema,
but it’s obvious there is
one:
First name:string
Last name: string
Gender: string
Age: integer
Let’s see how to get Spark to infer the schema.
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
57. Schema Inference :: Scala
57
import sqlContext.implicits._
case class Person(firstName: String,
lastName: String,
gender: String,
age: Int)
val rdd = sc.textFile("people.csv")
val peopleRDD = rdd.map { line =>
val cols = line.split(",")
Person(cols(0), cols(1), cols(2), cols(3).toInt)
}
val df = peopleRDD.toDF
// df: DataFrame = [firstName: string, lastName: string,
gender: string, age: int]
58. A brief look at spark-csv
Let’s assume our data file has a header:
58
first_name,last_name,gender,age
Erin,Shannon,F,42
Norman,Lockwood,M,81
Miguel,Ruiz,M,64
Rosalita,Ramirez,F,14
Ally,Garcia,F,39
Claire,McBride,F,23
Abigail,Cottrell,F,75
José,Rivera,M,59
Ravi,Dasgupta,M,25
…
59. A brief look at spark-csv
With spark-csv, we can simply create a DataFrame
directly from our CSV file.
59
// Scala
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
load("people.csv")
# Python
df = sqlContext.read.format("com.databricks.spark.csv").
load("people.csv", header="true")
60. 60
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
61. 61
DataFrames: Under the hood
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
CostModel
Physical
Plans
Catalog
DataFrame Operations
Selected
Physical
Plan
62. Catalyst Optimizations
Logical Optimizations
Create Physical Plan &
generate JVM bytecode
• Push filter predicates down to data source,
so irrelevant data can be skipped
• Parquet: skip entire blocks, turn
comparisons on strings into cheaper
integer comparisons via dictionary
encoding
• RDBMS: reduce amount of data traffic by
pushing predicates down
• Catalyst compiles operations into physical
plans for execution and generates JVM
bytecode
• Intelligently choose between broadcast
joins and shuffle joins to reduce network
traffic
• Lower level optimizations: eliminate
expensive object allocations and reduce
virtual function calls
63. Not Just Less Code: Faster Implementations
63
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
https://gist.github.com/rxin/c1592c133e4bccf515dd
64. Catalyst Goals
64
1) Make it easy to add new optimization techniques and features
to Spark SQL
2) Enable developers to extend the optimizer
• For example, to add data source specific rules that can push filtering
or aggregation into external storage systems
• Or to support new data types
65. Catalyst: Trees
65
• Tree: Main data type in Catalyst
• Tree is made of node objects
• Each node has type and 0 or
more children
• New node types are defined as
subclasses of TreeNode class
• Nodes are immutable and are
manipulated via functional
transformations
• Literal(value: Int): a constant value
• Attribute(name: String): an attribute from an input row, e.g.,“x”
• Add(left: TreeNode, right: TreeNode): sum of two
expressions.
Imagine we have the following 3 node classes for a very simple
expression language:
Build a tree for the expression: x + (1+2)
In Scala code: Add(Attribute(x), Add(Literal(1),
Literal(2)))
66. Catalyst: Rules
66
• Rules: Trees are manipulated
using rules
• A rule is a function from a tree to
another tree
• Commonly, Catalyst will use a set
of pattern matching functions to
find and replace subtrees
• Trees offer a transform method
that applies a pattern matching
function recursively on all nodes
of the tree, transforming the ones
that match each pattern to a
result
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
}
Let’s implement a rule that folds Add operations between
constants:
Apply this to the tree: x + (1+2)
Yields: x + 3
• The rule may only match a subset of all possible input trees
• Catalyst tests which parts of a tree a given rule may apply to,
and skips over or descends into subtrees that do not match
• Rules don’t need to be modified as new types of operators are
added
67. Catalyst: Rules
67
tree.transform {
case Add(Literal(c1), Literal(c2)) =>
Literal(c1+c2)
case Add(left, Literal(0)) => left
case Add(Literal(0), right) => right
}
Rules can match multiple patterns in the same transform call:
Apply this to the tree: x + (1+2)
Still yields: x + 3
Apply this to the tree: (x+0) + (3+3)
Now yields: x + 6
68. Catalyst: Rules
68
• Rules may need to execute multiple times to fully transform a
tree
• Rules are grouped into batches
• Each batch is executed to a fixed point (until tree stops
changing)
Example:
• Constant fold larger trees
Example:
• First batch analyzes an expression to assign
types to all attributes
• Second batch uses the new types to do
constant folding
• Rule conditions and their bodies contain arbitrary Scala code
• Takeaway: Functional transformations on immutable trees (easy to reason &
debug)
• Coming soon: Enable parallelization in the optimizer
69. 69
Using Catalyst in Spark SQL
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
Analysis: analyzing a logical plan to resolve references
Logical Optimization: logical plan optimization
Physical Planning: Physical planning
Code Generation: Compile parts of the query to Java
bytecode
70. Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog- - - - - -
DF • Relation may contain unresolved attribute
references or relations
• Example: “SELECT col FROM sales”
• Type of col is unknown
• Even if it’s a valid col name is unknown (till we look up the
table)
71. Catalyst: Analysis
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
• Attribute is unresolved if:
• Catalyst doesn’t know its type
• Catalyst has not matched it to an input table
• Catalyst will use rules and a Catalog object (which tracks all the
tables in all data sources) to resolve these attributes
Step 1: Build “unresolved logical plan”
Step 2: Apply rules
Analysis Rules
• Look up relations by name in Catalog
• Map named attributes (like col) to the
input
• Determine which attributes refer to the
same value to give them a unique ID (for
later optimizations)
• Propagate and coerce types through
expressions
• We can’t know return type of 1 + col until we
have resolved col
75. Catalyst: Physical Planning
75
• Spark SQL takes a logical plan and generations one or more
physical plans using physical operators that match the Spark
Execution engine:
1. mapPartitions()
2. new ShuffledRDD
3. zipPartitions()
• Currently cost-based optimization is only used to select a join
algorithm
• Broadcast join
• Traditional join
• Physical planner also performs rule-based physical
optimizations like pipelining projections or filters into one Spark
map operation
• It can also push operations from logical plan into data sources
(predicate pushdown)
Optimized
Logical Plan
Physical
Planning
Physical
Plans
77. Catalyst: Code Generation
77
• Generates Java bytecode to run on each machine
• Catalyst relies on janino to make code generation simple
• (FYI - It used to be quasiquotes, but now is janino)RDDs
Selected
Physical
Plan
Code
Generation
This code gen function converts an expression
like (x+y) + 1 to a
Scala AST:
79. Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity = udf(lambda zipCode: <custom logic here>)
def add_demographics(events):
u = sqlCtx.table("users")
events
.join(u, events.user_id == u.user_id)
.withColumn("city", zipToCity(df.zip))
Augments
any
DataFrame
that contains
user_id 79
80. Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
80
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events
.where(events.city == "San Francisco")
.select(events.timestamp)
.collect()
81. 81
def add_demographics(events):
u = sqlCtx.table("users") # Load Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
events = add_demographics(sqlCtx.load("/data/events", "json"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
81
82. 82
def add_demographics(events):
u = sqlCtx.table("users") # Load partitioned Hive table
events
.join(u, events.user_id == u.user_id) # Join on user_id
.withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events = add_demographics(sqlCtx.load("/data/events", "parquet"))
training_data = events.where(events.city == "San Francisco").select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
82
83. Spark 1.5 –Speed / Robustness
Project Tungsten
– Tightly packed binary
structures
– Fully-accounted
memory with automatic
spilling
– Reduced serialization
costs
83
0
20
40
60
80
100
120
140
160
180
200
1x 2x 4x 8x 16x
Average
GC
time per
node
(seconds)
Data set size (relative)
Default Code Gen
Tungsten onheap Tungsten offheap
84. 100+ native functions with
optimized codegen
implementations
– String manipulation –
concat, format_string,
lower, lpad
– Data/Time –
current_timestamp,
date_format, date_add
– Math – sqrt, randn
– Other –
monotonicallyIncreasingId,
sparkPartitionId
84
Spark 1.5 – Improved Function Library
from pyspark.sql.functions import *
yesterday = date_sub(current_date(), 1)
df2 = df.filter(df.created_at > yesterday)
import org.apache.spark.sql.functions._
val yesterday = date_sub(current_date(), 1)
val df2 = df.filter(df("created_at") > yesterday)
85. Window Functions
Before Spark 1.4:
- 2 kinds of functions in Spark that could return a single
value:
• Built-in functions or UDFs (round)
• take values from a single row as input, and they
generate a single return value for every input row
• Aggregate functions (sum or max)
• operate on a group of rows and calculate a single
return value for every group
New with Spark 1.4:
• Window Functions (moving avg, cumulative sum)
• operate on a group of rows while still returning a single
value for every input row.
86.
87. Streaming DataFrames
Umbrella ticket to track what's needed to
make streaming DataFrame a reality:
https://issues.apache.org/jira/browse/SPARK-8360
Editor's Notes
This saturated both disk and network layers
Old Spark API (T&A) is based on Java/Python objects
- this makes it hard for the engine to store compactly (java objects in memory have a lot of extra space for what classes, pointers to various things, etc)
- cannot understand semantics of user functions
- so if you run a map function over just one field of the data, it still has to read the entire object into memory. Spark doesn't know you only cared about one field.
DataFrames were inspired by previous distributed data frame efforts, including Adatao’s DDF and Ayasdi’s BigDF. However, the main difference from these projects is that DataFrames go through the Catalyst optimizer, enabling optimized execution similar to that of Spark SQL queries.
a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
I’d say that DataFrame is a result of transformation of any other RDD. Your input RDD might contains strings and numbers. But as a result of transformation you end up with RDD that contains GenericRowWithSchema, which is what DataFrame actually is. So, I’d say that DataFrame is just sort of wrapper around simple RDD, which provides some additional and pretty useful stuff.
To compute an average. I have a dataset that is a list of names and ages. Want to figure out the average age for a given name. So, age distribution for a name…
Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
Head is non-deterministic, could change between jobs. Just the first partition that materialized returns results.
SparkSQL is the only project (1.4+) can read from multiple version of Hive. Spark 1.5 can read from 0.12 – 1.2
A lot of the hive functionality is useful even if you don’t have a Hive installation! Spark will automatically create a local copy of the Hive metastore so use can use Window functions, Hive UDFS, create persistent tables,
To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build.
The specific variant of SQL that is used to parse queries can also be selected using the spark.sql.dialect option. This parameter can be changed using either the setConf method on a SQLContext or by using a SET key=valuecommand in SQL. For a SQLContext, the only dialect available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the default is “hiveql”, though “sql” is also available.
The following example shows how to construct DataFrames in Python. A similar API is available in Scala and Java.
Once built, DataFrames provide a domain-specific language for distributed data manipulation. Here is an example of using DataFrames to manipulate the demographic data of a large population of users:
You can also incorporate SQL while working with DataFrames, using Spark SQL. This example counts the number of users in the young DataFrame.
But its not the same as if you called .cache() on an RDD[Row], since we reencode the data into bytebuffers before calling caching so that there is less pressure on the GC.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
the full RDD API cannot be released on Dataframes. Per Michael, absolute freedom for users restricts the types of optimizations that we can do.
Finally, a Data Source for reading from JDBC has been added as built-in source for Spark SQL. Using this library, Spark SQL can extract data from any existing relational databases that supports JDBC. Examples include mysql, postgres, H2, and more. Reading data from one of these systems is as simple as creating a virtual table that points to the external table. Data from this table can then be easily read in and joined with any of the other sources that Spark SQL supports.
This functionality is a great improvement over Spark’s earlier support for JDBC (i.e.,JdbcRDD). Unlike the pure RDD implementation, this new DataSource supports automatically pushing down predicates, converts the data into a DataFrame that can be easily joined, and is accessible from Python, Java, and SQL in addition to Scala.
Twitter and Cloudera merged efforts in 2012 to develop a columnar format
Parquet is a column based storage format. It gets its name from the patterns in parquet flooring. Optimized use case for parquet is when you only need a subset of the total columns.
Avro is better if you typically scan/read all of the fields in a row in each query.
Typically, one of the most expensive parts of reading and writing data is (de)serialization. Parquet supports predicate push-down and schema projection to target specific columns in your data for filtering and reading — keeping the cost of deserialization to a minimum.
Parquet compresses better because columns have a fixed data type (like string, integer, Boolean, etc). it is easier to apply any encoding schemes on columnar data which could even be column specific such as delta encoding for integers and prefix/dictionary encoding for strings. Also, due to the homogeneity in data, there is a lot more redundancy and duplicates in the values in a given column. This allows better compression in comparison to data stored in row format.
The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.
Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).
Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).
https://parquet.apache.org/documentation/latest/
Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
Column chunk: A chunk of the data for a particular column. These live in a particular row group and is guaranteed to be contiguous in the file.
Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.
Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.
Metadata is written after the data to allow for single pass writing.
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.
There are three types of metadata: file metadata, column (chunk) metadata and page header metadata. All thrift structures are serialized using the TCompactProtocol.
The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.
Ideal row group size: 512 MB – 1 GB. Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file. Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write).
Data page size: 8 KB recommended. Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).
https://parquet.apache.org/documentation/latest/
First, organizations that store lots of data in parquet often find themselves evolving the schema over time by adding or removing columns. With this release we add a new feature that will scan the metadata for all files, merging the schemas to come up with a unified representation of the data. This functionality allows developers to read data where the schema has changed overtime, without the need to perform expensive manual conversions.
-
In Spark 1.4, we plan to provide an interface that will allow other formats, such as ORC, JSON and CSV, to take advantage of this partitioning functionality.
On the builder you can specific methods…
Like do you want to overwrite data already there?
Load or save or saveAsTable are actions.
Note that by default in Spark SQL, there is a parameter called spark.sql.shuffle.partititions, which sets the # of partitions in a Dataframe after a shuffle (in case the user didn't manually specify it). Currently, Spark does not do any automatic determination of partitions, it just uses the # in that parameter. Doing more automclasses for Databrick stuff is on our roadmap though. You can change this parameter using: sqlContex.setConf(key, value).
1.6 = adaptive shuffle, look at output of map side, then pick # of reducers. Matie and Yin's hack day project.
Case classes are used when creating classes that primarily hold data.
When your class is basically a data-holder, case classes simplify your
code and perform common work.
With case classes, unlike regular classes, we don’t have to use the
new keyword when creating an object.
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection (seen in green) and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.
Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface.
- - - -
What is a case class (vs a normal class)?
Original purpose was used for matching, but it’s used for more now
Scala’s version of a java bean (java has classes primary for data (gettings and settings) and there’s classes mostly for operations
Case classes are mostly for data
Scala can do reflection to establish/infer the schema of the df (seen in green).
You’d want to be more robust about parsing CSV in real life.
peopleRDD.toDF uses (a) Scala implicits and (b) the type of the RDD (RDD[Person]) to infer the schema
Mention that a case class, in Scala, is basically a Scala bean: A container for data, augmented with useful things by the Scala compiler.
Catalyst is a powerful new optimization framework. The Catalyst framework allows the developers behind Spark SQL to rapidly add new optimizations, enabling us to build a faster system more quickly.
Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst.
Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
Unlike the eagerly evaluated data frames in R and Python, DataFrames in Spark have their execution automatically optimized by a query optimizer called Catalyst.
Before any computation on a DataFrame starts, the Catalyst optimizer compiles the operations that were used to build the DataFrame into a physical plan for execution.
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generatesJVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users.
Since the optimizer generates JVM bytecode for execution, Python users will experience the same high performance as Scala and Java users.
The above chart compares the runtime performance of running group-by-aggregation on 10 million integer pairs on a single machine (source code). Since both Scala and Python DataFrame operations are compiled into JVM bytecode for execution, there is little difference between the two languages, and both outperform the vanilla Python RDD variant by a factor of 5 and Scala RDD variant by a factor of 2.
At its core, Catalyst contains a general library for representing trees and applying rules to manipulate them.
A tree is just a Scala object.
- -
These classes can be used to build up trees; for example, the tree for the expression x+(1+2), would be represented in Scala code as follows: (See Scala code and Diagram)
Pattern matching is a feature of many functional languages that allows extracting values from potentially nested structures of algebraic data types.
-
The case keyword here is Scala’s standard pattern matching syntax, and can be used to match on the type of an object as well as give names to extracted values (c1 and c2 here).
-
The pattern matching expression that is passed to transform is a partial function, meaning that it only needs to match to a subset of all possible input trees.
This ability means that rules only need to reason about the trees where a given optimization applies and not those that do not match. Thus, rules do not need to be modified as new types of operators are added to the system.
Rules (and Scala pattern matching in general) can match…
-
In the example above, repeated application would constant-fold larger trees, such as (x+0)+(3+3)
Examples below show why you may want to run a rule multiple times
Running rules to fixed point means that each rule can be simple and self-contained, and yet still eventually have larger global effects on a tree.
-
In our experience, functional transformations on immutable trees make the whole optimizer very easy to reason about and debug. They also enable parallelization in the optimizer, although we do not yet exploit this.
In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost.
All other phases are purely rule-based. Each phase uses different types of tree nodes; Catalyst includes libraries of nodes for expressions, data types, and logical and physical operators.
-
Does Catalyst currently have the capability to generate multiple physical plans? You had mentioned at TtT last week that costing is done eagerly to prune branches that are not allowed (aka greedy algorithm).
-
Greedy Algorithm works by making the decision that seems most promising at any moment; it never reconsiders this decision, whatever situation may arise later.
A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API.
A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code.
The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
-
An abstract syntax tree for the following code for the Euclidean algorithm:while b ≠ 0if a > ba := a − belseb := b − areturn a
- -
Spark SQL begins with a relation to be computed, either from an abstract syntax tree (AST) returned by a SQL parser, or from a DataFrame object constructed using the API.
A syntax tree is a tree representation of the structure of the source code. Each node of the tree denotes a construct occurring in the source code.
The syntax is "abstract" in not representing every detail appearing in the real syntax. For instance, grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches.
-
An abstract syntax tree for the following code for the Euclidean algorithm:while b ≠ 0if a > ba := a − belseb := b − areturn a
- -
A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references.
* Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
Note, this is not cost based. Cost-based optimization is performed by generating multiple plans using rules, and then computing their costs.
A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references.
* Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
The framework supports broader use of cost-based optimization, however, as costs can be estimated recursively for a whole tree using a rule. We thus intend to implement richer cost-based optimization in the future.
A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references.
* Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
As a simple example, consider the Add, Attribute and Literal tree nodes introduced in Section 4.2, which allowed us to write expressions such as (x+y)+1.
Without code generation, such expressions would have to be interpreted for each row of data, by walking down a tree of Add, Attribute and Literal nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, we can write a function to translate a specific expression tree to a Scala AST as follows:
The strings beginning with q are quasiquotes, meaning that although they look like strings, they are parsed by the Scala compiler at compile time and represent ASTs for the code within. Quasiquotes can have variables or other ASTs spliced into them, indicated using $ notation. For example, Literal(1) would become the Scala AST for 1, whileAttribute("x") becomes row.get("x"). In the end, a tree like Add(Literal(1), Attribute("x")) becomes an AST for a Scala expression like 1+row.get("x").
A trivial [[Analyzer]] with an [[EmptyCatalog]] and [[EmptyFunctionRegistry]]. Used for testing * when all relations are already filled in and the analyser needs only to resolve attribute * references.
* Provides a logical query plan analyzer, which translates [[UnresolvedAttribute]]s and * [[UnresolvedRelation]]s into fully typed objects using information in a schema [[Catalog]] and * a [[FunctionRegistry]].
Sometimes you want to call complex functions to do additional work inside of the SQL queries.
UDFs can be inlined in the DataFrame code
UDF zipToCity just invokes a lamda function that takes a zipCode and does some custom logic to figure out which city the zip code is located in.
I have a function called add demographics, which takes a data frame w/ a user ID and will automatically compute a bunch of demographic information.
So, we do a join based on UserID and then adds a new column with the .withColumn… the UDF results will add a new column.
This def returns a new DataFrame
All of this is lazy, so SparkSQL can do optimizations much later.
For this type of machine learning I’m doing, I may only need the ts column from San Francisco.
Note that add_demographics does have extra functionality to filter down to just SF and ts.
So maybe add_demographics was by my co-worker and I just want to use it.
So we construct a logical query plan.
Since this planning is happening at the logical level, optimizations can even occur across function calls, as shown in the example below.
In this example, Spark SQL is able to push the filtering of users by their location through the join, greatly reducing its cost to execute. This optimization is possible even though the original author of the add_demographics function did not provide a parameter for specifying how to filter users!
-
Ideally we want to filter the users ahead of time based on the extra predicates, and only do the join on the relevent users.
Even cooler, if I want to optimize this later on…
So, here I changed to a partitioned hive table (users) and also used parquet instead of JSON (events).
Now with Parquet, SparkSQL notices there are new optimizations that it can now do
Idea of Project Tungsten is to reimagine the execution engine for SparkSQL
As a user, when you move to 1.5 you will see significant robustness and speed improvements
Before you had to resort to Hive UDFs or drop into SQL…
But now there are 100+ native functions added that at runtime Java bytecode will be constructed to evaluate whatever you need.
Pass a physical plan generated by Catalyst into Streaming