The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL.
- Overview of Spark in the Hadoop ecosystem
- Deep dive into Spark SQL with step by steps on how to implement and use it
If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/
A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Spark, the ultra-fast, general purpose big data computing platform provides some very flexible options for processing and accessing data. In a previous meetup we covered PySpark and the Schema RDD. In this session we reviewed and expanded on this, with an in-depth exploration of Spark SQL.
- Overview of Spark in the Hadoop ecosystem
- Deep dive into Spark SQL with step by steps on how to implement and use it
If you have questions about the presentation or want to learn more about our services, please visit our website: http://casertaconcepts.com/
A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Ayasdi presentation in Intel's pavilion @Strata 2015 (San Jose). Highlighting, Ayasdi's approach to analyzing large complex data, and our integration into the Hadoop ecosystem.
One key benefit of Hadoop is its ability to support multiple access frameworks, so users of all types always have access to the best tool for the job. These purpose-built frameworks can open up a variety of powerful use cases, but only if you know which one to use and when.
Join us to discuss the three "SQL-on-Hadoop" frameworks available in Cloudera's platform and learn:
- What audiences and use cases Hive, Impala, and SparkSQL are best for
- Why an interactive SQL tool like Impala is still essential for BI
- Future work planned for these "SQL-on-Hadoop" technologies
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Event: #SE2016
Stage: IoT & BigData
Data: 2 of September 2016
Speaker: Vitalii Bondarenko
Topic: HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
2. ● Located in Encinitas, CA & Austin, TX
● We work on a technology called Data Algebra
● We hold nine patents in this technology
● Create turnkey performance enhancement for db engines
● We’re working on a product called Algebraix Query Accelerator
● The first public release of the product focuses on Apache Spark
● The product will be available in Amazon Web Services and Azure
About Algebraix
3. About Me
● I’m a long time developer turned product person
● I love music (playing music and listening to music)
● I grew up in Palm Springs, CA
● Development: Full stack developer, UX developer, API dev, etc
● Product: worked in UX Lead, Director Product Innovation, VP Product
● Certs: AWS Solutions Architect, AWS Developer
● Education: UCI - Polical Science, Chemistry
4. About You
● Does anyone here use:
○ Spark or Spark SQL? In Production?
○ Amazon Web Services?
○ SQL Databases?
○ NoSQL Databases?
○ Distributed Databases?
● What brings you in here?
○ Want to become a rockstar at work?
○ Trying to do analytics at a larger scale?
○ Want exposure to the new tools in the industry?
5. Goals of This Talk
For you to be able to get up and running quickly with real world data and produce
applications, while avoiding major pitfalls that most newcomers to Spark SQL
would encounter.
6. 1. Spark SQL use cases
2. Loading data: in the cloud vs locally, RDDs vs DataFrames
3. SQL vs the DataFrame API. What’s the difference?
4. Schemas: implicit vs explicit schemas, data types
5. Loading & saving results
6. What SQL functionality works and what doesn’t?
7. Using SQL for ETL
8. Working with JSON data
9. Reading and Writing to an external SQL databases
10. Testing your work in the real world
Spark SQL: 10 things you should know
7. Notes
>>>
When I write “>>>” it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
8. Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
9. Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
10. Notes
>>>
When I write >>> it means that I’m talking to the
pyspark console and what follows immediately
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
11. Notes
afterward is the output.
How to read what
I’m writing
>>> aBunchOfData.count()
901150
Otherwise, I’m just demonstrating pyspark code.
When I show you this guy ^^ it means we’re done
with this section, and feel free to ask questions
before we move on. =D
14. Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
ETL
capabilities
alongside
familiar
SQL
1
15. Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
ETL
capabilities
alongside
familiar
SQL
Live SQL
analytics
over
streaming
data
1
16. Spark SQL
Use Cases
What, when, why
Ad-hoc
querying of
data in files
Interaction
with
external
Databases
ETL
capabilities
alongside
familiar
SQL
Live SQL
analytics
over
streaming
data
1
19. Loading
Data
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.
Cloud vs local,
RDD vs DF
2
20. Loading
Data
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”
2
21. Loading
Data
Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
You can load data directly into a DataFrame, and
begin querying it relatively quickly. Otherwise
you’ll need to load data into an RDD and
transform it first.
2
22. Loading
Data
Cloud vs local,
RDD vs DF
# loading data into an RDD in Spark 2.0
sc = spark.sparkContext
oneSysLog = sc.textFile(“file:/var/log/system.log”)
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
you’ll need to load data into an RDD and
transform it first.
2
23. Loading
Data
Cloud vs local,
RDD vs DF
allLogs = sc.textFile(“file:/var/log/*.log”)
# lets count the lines in each RDD
>>> oneSysLog.count()
8339
>>> allSysLogs.count()
47916
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
2
24. Loading
Data
Cloud vs local,
RDD vs DF
>>> allLogs.count()
546254
That’s great, but you can’t query this. You’ll need
to convert the data to Rows, add a schema, and
convert it to a dataframe.
# import Row, map the rdd, and create dataframe
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
2
25. Loading
Data
Cloud vs local,
RDD vs DF
from pyspark.sql import Row
sc = spark.sparkContext
allSysLogs = sc.textFile(“file:/var/log/system.log*”)
logsRDD = allSysLogs.map(lambda logRow: Row(log=logRow))
logsDF = spark.createDataFrame(logsRDD)
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
2
26. Loading
Data
Cloud vs local,
RDD vs DF
Once the data is converted to at least a
DataFrame with a schema, now you can talk SQL
to the data.
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
2
27. Loading
Data
Cloud vs local,
RDD vs DF
# write some SQL
logsDF = spark.createDataFrame(logsRDD)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
2
28. Loading
Data
Cloud vs local,
RDD vs DF
# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet(“file:/logs.parquet”)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
But, you can also load certain types of data and
store it directly as a DataFrame. This allows you
to get to SQL quickly.
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
2
29. Loading
Data
Cloud vs local,
RDD vs DF
# load parquet straight into DF, and write some SQL
logsDF = spark.read.parquet(“file:/logs.parquet”)
logsDF.createOrReplaceTempView(“logs”)
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| row|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
Both JSON and Parquet formats can be loaded
as a DataFrame straightaway because they
contain enough schema information to do so.
In fact, now they even have support for querying
parquet files directly! Easy peasy!
2
30. Loading
Data
Cloud vs local,
RDD vs DF
>>> spark.sql(“SELECT * FROM logs LIMIT 1”).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
In fact, now they even have support for querying
parquet files directly! Easy peasy!
# load parquet straight into DF, and write some SQL
>>> spark.sql(“””
SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1
”””).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
2
31. Loading
Data
Cloud vs local,
RDD vs DF
In fact, now they even have support for querying
parquet files directly! Easy peasy!
# load parquet straight into DF, and write some SQL
>>> spark.sql(“””
SELECT * FROM parquet.`path/to/logs.parquet` LIMIT 1
”””).show()
+--------------------+
| log|
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
That’s one aspect of loading data. The other
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
2
32. Loading
Data
Cloud vs local,
RDD vs DF
+--------------------+
|Jan 6 16:37:01 (...|
+--------------------+
That’s one aspect of loading data. The other
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”)
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
2
33. Loading
Data
Cloud vs local,
RDD vs DF
aspect is using the protocols for cloud storage
(i.e. s3://). In some cloud ecosystems, support for
their storage protocol comes installed already.
# i.e. on AWS EMR, s3:// is installed already.
sc = spark.sparkContext
decemberLogs = sc.textFile(“s3://acme-co/logs/2016/12/”)
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
2
34. Loading
Data
Cloud vs local,
RDD vs DF
# count the lines in all of the december 2016 logs in S3
>>> decemberLogs.count()
910125081250
# wow, such logs. Ur poplar.
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 demo2.py
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
2
35. Loading
Data
Cloud vs local,
RDD vs DF
Sometimes you actually need to provide support
for those protocols if your VM’s OS doesn’t have
it already.
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
Now you should have several ways to load data
to quickly start writing SQL with Apache Spark.
2
36. Loading
Data
Cloud vs local,
RDD vs DF
my-linux-shell$ pyspark --packages
com.amazonaws:aws-java-sdk-pom:1.10.34,com.amazonaws:aws-jav
a-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1
>>> rdd = sc.readText(“s3a://acme-co/path/to/files”)
rdd.count()
# note: “s3a” and not “s3” -- “s3” is specific to AWS EMR.
Now you should have several ways to load data
to quickly start writing SQL with Apache Spark.
2
50. Schemas
Inferred
vs
explicit
Schemas can be inferred, i.e. guessed, by spark.
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
4
51. Schemas
Inferred
vs
explicit
Schemas can be inferred, i.e. guessed, by spark.
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
# sample data - “people.txt”
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
4
52. Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
With inferred schemas, you usually end up with a
bunch of strings and ints. If you have more
specific needs, supply your own schema.
# sample data - “people.txt”
1|Kristian|Algebraix Data|San Diego|CA
2|Pat|Algebraix Data|San Diego|CA
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
4
53. Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
3|Lebron|Cleveland Cavaliers|Cleveland|OH
4|Brad|Self Employed|Hollywood|CA
# we didn’t actually pass anything into that 2nd param.
# yet, behind the scenes, there’s still a schema.
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
4
54. Schemas
Inferred
vs
explicit
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
# full syntax: .createDataFrame(peopleRDD, schema)
# we didn’t actually pass anything into that 2nd param.
# yet, behind the scenes, there’s still a schema.
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
Spark SQL can certainly handle queries where
`id` is a string, but it should be an int.
4
55. Schemas
Inferred
vs
explicit
Spark SQL can certainly handle queries where
`id` is a string, but what if we don’t want it to be?
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
4
56. Schemas
Inferred
vs
explicit
# load as RDD and map it to a row with multiple fields
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=int(s[0]),name=s[1],company=s[2],state=s[4])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD)
>>> peopleDF.printSchema()
Root
|-- company: string (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- state: string (nullable = true)
You can actually provide a schema, too, which
will be more authoritative.
4
57. Schemas
Inferred
vs
explicit
You can actually provide a schema, too, which
will be more authoritative.
# load as RDD and map it to a row with multiple fields
import pyspark.sql.types as types
rdd = sc.textFile(“file:/people.txt”)
def mapper(line):
s = line.split(“|”)
return Row(id=s[0],name=s[1],company=s[2],state=s[4])
schema = types.StructType([
types.StructField(‘id',types.IntegerType(), False)
,types.StructField(‘name',types.StringType())
,types.StructField(‘company',types.StringType())
,types.StructField(‘state',types.StringType())
])
peopleRDD = rdd.map(mapper)
peopleDF = spark.createDataFrame(peopleRDD, schema)
4
61. # Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
And what are the available types?
# http://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/types.html
__all__ = ["DataType", "NullType", "StringType",
"BinaryType", "BooleanType", "DateType", "TimestampType",
"DecimalType", "DoubleType", "FloatType", "ByteType",
"IntegerType", "LongType", "ShortType", "ArrayType",
"MapType", "StructField", "StructType"]
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
4
62. # Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
"BinaryType", "BooleanType", "DateType", "TimestampType",
"DecimalType", "DoubleType", "FloatType", "ByteType",
"IntegerType", "LongType", "ShortType", "ArrayType",
"MapType", "StructField", "StructType"]
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
Now you know about inferred and explicit
schemas, and the available types you can use.
4
63. # Spark SQL handles this just fine as if they were
# legit date objects.
spark.sql(“””
SELECT * FROM NewHires n WHERE n.start_date > “2016-01-01”
“””).show()
Schemas
available types
**Gotcha alert** Spark doesn’t seem to care
when you leave dates as strings.
Now you know about inferred and explicit
schemas, and the available types you can use.
4
66. Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.
5
67. Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
Loading and Saving is fairly straight forward.
Save your dataframes in your desired format.
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
68. # read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)
Loading &
Saving
Results
Types,
performance,
& considerations
# picking up where we left off
peopleDF = spark.createDataFrame(peopleRDD, schema)
peopleDF.write.save("s3://acme-co/people.parquet",
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
69. # read from stored parquet
peopleDF = spark.read.parquet(“s3://acme-co/people.parquet”)
# read from stored JSON
peopleDF = spark.read.json(“s3://acme-co/people.json”)
Loading &
Saving
Results
Types,
performance,
& considerations
format="parquet") # format= defaults to parquet if omitted
# formats: json, parquet, jdbc, orc, libsvm, csv, text
When you read, some types preserve schema.
Parquet keeps the full schema, JSON has
inferrable schema, and JDBC pulls in schema.
5
71. ● Limited support for subqueries and various
other noticeable SQL functionalities
● Runs roughly half of the 99 TPC-DS
benchmark queries
● More SQL support in HiveContext
SQL
Function
Coverage
Spark 1.6
6
Spark 1.6
http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
72. SQL
Function
Coverage
Spark 2.0
6
Spark 2.0
In DataBricks’ Words
● SQL2003 support
● Runs all 99 of TPC-DS benchmark queries
● A native SQL parser that supports both
ANSI-SQL as well as Hive QL
● Native DDL command implementations
● Subquery support, including
○ Uncorrelated Scalar Subqueries
○ Correlated Scalar Subqueries
○ NOT IN predicate Subqueries (in
WHERE/HAVING clauses)
○ IN predicate subqueries (in
WHERE/HAVING clauses)
○ (NOT) EXISTS predicate subqueries
(in WHERE/HAVING clauses)
● View canonicalization support
● In addition, when building without Hive
support, Spark SQL should have almost all
the functionality as when building with Hive
support, with the exception of Hive
connectivity, Hive UDFs, and script
transforms.
http://www.slideshare.net/databricks/apache-spark-20-faster-easier-and-smarter
73. SQL
Function
Coverage
Spark 2.0
6
Spark 2.0
In DataBricks’ Words
● SQL2003 support
● Runs all 99 of TPC-DS benchmark queries
● A native SQL parser that supports both
ANSI-SQL as well as Hive QL
● Native DDL command implementations
● Subquery support, including
○ Uncorrelated Scalar Subqueries
○ Correlated Scalar Subqueries
○ NOT IN predicate Subqueries (in
WHERE/HAVING clauses)
○ IN predicate subqueries (in
WHERE/HAVING clauses)
○ (NOT) EXISTS predicate subqueries
(in WHERE/HAVING clauses)
● View canonicalization support
● In addition, when building without Hive
support, Spark SQL should have almost all
the functionality as when building with Hive
support, with the exception of Hive
connectivity, Hive UDFs, and script
transforms.
75. Using
Spark SQL
for ETL
Tips and tricks
7
These things are some of the things we learned
to start doing after working with Spark a while.
76. Using
Spark SQL
for ETL
Tips and tricks
7
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
These things are some of the things we learned
to start doing after working with Spark a while.
77. Using
Spark SQL
for ETL
Tips and tricks
7
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
These things are some of the things we learned
to start doing after working with Spark a while.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
78. Using
Spark SQL
for ETL
Tips and tricks
7
These things are some of the things we learned
to start doing after working with Spark a while.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
Tip 3: If using EMR, create a cluster with your
desired steps, prove it works, then export a CLI
command to reproduce it, and run it in Data
Pipeline to start recurring pipelines / jobs.
79. Using
Spark SQL
for ETL
Tips and tricks
7
Tip 3: If using EMR, create a cluster with your
desired steps, prove it works, then export a CLI
command to reproduce it, and run it in Data
Pipeline to start recurring pipelines / jobs.
Tip 1: In production, break your applications into
smaller apps as steps. I.e. “Pipeline pattern”
Tip 2: When tinkering locally, save a small
version of the dataset via spark and test against
that.
81. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
82. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
83. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
84. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
85. Working
with JSON
Quick and easy
8
JSON data must be
read-in as line
delimied json
objectsJSON
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
# json explode example
>>> spark.read.json("file:/json.explode.json").createOrReplaceTempView("json")
>>> spark.sql("SELECT * FROM json").show()
+----+----------------+
| x| y|
+----+----------------+
|row1| [1, 2, 3, 4, 5]|
|row2|[6, 7, 8, 9, 10]|
+----+----------------+
>>> spark.sql("SELECT x, explode(y) FROM json").show()
+----+---+
| x|col|
+----+---+
|row1| 1|
|row1| 2|
|row1| 3|
|row1| 4|
|row1| 5|
|row2| 6|
|row2| 7|
|row2| 8|
|row2| 9|
|row2| 10|
+----+---+
86. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
87. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
88. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
89. Working
with JSON
Quick and easy
8 If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
For multi-line JSON files,
you’ve got to do much more:
# a list of data from files.
files = sc.wholeTextFiles("data.json")
# each tuple is (path, jsonData)
rawJSON = files.map(lambda x: x[1])
# sanitize the data
cleanJSON = rawJSON.map(
lambda x: re.sub(r"s+", "",x,flags=re.UNICODE)
)
# finally, you can then read that in as “JSON”
spark.read.json( scrubbedJSON )
# PS -- the same goes for XML.
90. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
91. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
92. Working
with JSON
Quick and easy
8
JSON data is most
easily read-in as
line delimied json
objects*
{“n”:”sarah”,”age”:29}
{“n”:”steve”,”age”:45}
If you want to
flatten your
JSON data, use
the explode
method
(works in both DF API
and SQL)
Access
nested-objects
with dot syntax
SELECT
field.subfield
FROM json
Access arrays
with inline array
syntax
SELECT
col[1], col[3]
FROM json
Schema is
inferred upon
load. Unlike
other lazy
operations, this
will cause some
work to be
done.
94. External
Databases
Reading and
Writing
9
To read from an external database, you’ve got to
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
95. External
Databases
Reading and
Writing
9
To read from an external database, you’ve got to
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark
--jars /path/to/mysql-jdbc.jar
--packages
# note: you can also add the path to your jar in the
spark.defaults config file to these settings:
spark.driver.extraClassPath
spark.executor.extraClassPath
96. Once you’ve got your connector jars successfully
imported, now you can read an existing database
into your spark application or spark shell as a
dataframe.
External
Databases
Reading and
Writing
9
have your JDBC connectors (jar) handy. In order
to pass a jar package into spark, you’d use the
--jars flag when starting pyspark or spark-submit.
# loading data into an RDD in Spark 2.0
my-linux-shell$ pyspark
--jars /path/to/mysql-jdbc.jar
--packages
# note: you can also add the path to your jar in the
spark.defaults config file to these settings:
spark.driver.extraClassPath
spark.executor.extraClassPath
97. Once you’ve got your connector jars successfully
imported, now you can read an existing database
into your spark application or spark shell as a
dataframe.External
Databases
Reading and
Writing
9
spark.driver.extraClassPath
spark.executor.extraClassPath
# line broken for readibility
sqlURL = “jdbc:mysql://<db-host>:<port>
?user=<user>
&password=<pass>
&rewriteBatchedStatements=true
&continueBatchOnError=true”
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
98. dataframe.
External
Databases
Reading and
Writing
9
# line broken for readibility
sqlURL = “jdbc:mysql://<db-host>:<port>
?user=<user>
&password=<pass>
&rewriteBatchedStatements=true # omg use this.
&continueBatchOnError=true” # increases performance.
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
99. External
Databases
Reading and
Writing
9
df = spark.read.jdbc(url=sqlURL, table="<db>.<table>")
df.createOrReplaceTempView(“myDB”)
spark.sql(“SELECT * FROM myDB”).show()
If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a
bit destructive. “Overwrite” doesn’t just overwrite
your data, it overwrites your schemas too.
Say goodbye to those precious indices.
100. External
Databases
Reading and
Writing
9 If you’ve done some work and built created or
manipulated a dataframe, you can write it to a
database by using the spark.read.jdbc
method. Be prepared, it can a while.
Also, be warned, save modes in spark can be a
bit destructive. “Overwrite” doesn’t just overwrite
your data, it overwrites your schemas too.
Say goodbye to those precious indices.
102. Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
103. Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
104. Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
105. Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
Get really
comfortable
using
.parallelize() to
create dummy
data.
106. Real World
Testing
Things to consider
If testing locally,
do not load data
from S3 or other
similar types of
cloud storage.
10
Construct your
applications as
much as you
can in advance.
Cloudy clusters
are expensive.
you can test a
lot of your code
reliably with a
1-node cluster.
Get really
comfortable
using
.parallelize() to
create dummy
data.
If you’re using
big data, and
many nodes,
don’t use
.collect() unless
you intend to