1 © 2016, Conversant, LLC. All rights reserved.
SCALA + SQL = UNION OF TWO EQUALS
APACHE BIG DATA NORTH AMERICA 2017 PRESENTED BY:
JAYESH THAKRAR
SENIOR SOFTWARE ENGINEER
2
SCALA + SQL = UNION OF TWO EQUALS
1. What does it mean?
Why does it matter?
2. Is Scala == SQL?
Show me!
3. Final Notes...
3
Scala == SQL
What and Why?
4
JDBC / ODBC
Host Language
C,
Java,
Python,
etc.
Embedded
SQL
• SQL = Set-level processing
• Host Language = Row-level,
iterative processing using
single iterator
(like drinking from a firehose!)
5
SPARK FRAMEWORK
SCALA
RPython
SQL
• Common practice:
Scala, Python, R for data in / out
SQL for data processing
• Scala = Spark Scala API
• Scala fuses seamlessly
with Spark DSL
• Can munge SQL result as
a distributed collection
6
WHY USE SCALA API?
• Its cool!
• Compact code and often more intuitive & self-explanatory
• Additional help from IDEs when using Scala API
• Concatenated methods /pipeline help visualize processing flow
• Ability to mix-and-match Scala API and SQL
• Ability to do "in-line-processing" using lambda functions
• Scala implicits allow further extending Scala API
7
GOALS OF THIS TALK
• Show Scala (API) can do the same as SQL
• Make using Scala API more comfortable
8
Is Scala == SQL?
Show me!
9
SQL AND SCALA API UNDER-THE-HOOD
Source: https://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s-
optimizer?qid=8b68ce4b-5aae-4e53-adf1-5cf5259a78b6&v=&b=&from_search=1
10
SAMPLE DATA
Derived from: https://www.transtats.bts.gov/Data_Elements.aspx?Data=1
Flight Data Passenger Data
11
SAMPLE DATA SCHEMA
scala> f
res9: org.apache.spark.sql.Dataset[Flight] =
[year: int, month: int ... 3 more fields]
scala> f.printSchema
root
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- domestic_flights: long (nullable = true)
|-- international_flights: long (nullable = true)
|-- total_flights: long (nullable = true)
scala> p
res11: org.apache.spark.sql.Dataset[Passenger] =
[year: int, month: int ... 3 more fields]
scala> p.printSchema
root
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- domestic_passengers: long (nullable = true)
|-- international_passengers: long (nullable = true)
|-- total_passengers: long (nullable = true)
12
SQL: SELECT CLAUSE / PROJECTION
SELECT year FROM f
SELECT year AS yyyy
FROM f
f.select("year")
f.select($"year")
f.select(f("year"))
f.select(col("year"))
f.select($"year" AS "yyyy")
f.withColumn($"year" AS "yyyy").
select($"yyyy")
13
SQL: COLUMNS
14
SQL: NEW COLUMNS, EXPRESSIONS
// arithmetic expression
f.select($"domestic_flights" / $"total_flights" as "flight_ratio")
// boolean expression
f.select($"domestic_flights" + $"international_flights" === $"total_flights")
// conditional expression
f.select($"domestic_flights" / $"international_flights" > 9.0 as "too_high")
// Adding a new column
f.withColumn($"domestic_flights" / $"international_flights" as "flight_ratio")
15
SQL: CASE
SELECT year,
CASE
WHEN month = 1
THEN "Jan"
WHEN month = 2
THEN "Feb"
.....
ELSE "Error"
END
f.select($"year",
when($"month" === 1, "Jan").
when($"month" === 2, "Feb").
when("$month" === 3, "Mar").
.....
otherwise("Error")
)
16
SQL: WHERE CLAUSE
WHERE month = 12
WHERE (month = 12 AND year > 2010)
OR (domestic_flights > 880000)
WHERE year IN (2004, 2007)
WHERE month IS NULL
WHERE month IS NOT NULL
f.where("month = 12") // SQL syntax
f.where($"month" === 12) // API syntax
f.where(
("month" === 12 && $("year") > 2010)
|| f("domestic_flights") > 880000)
f.filter(f("year").isin(2004, 2007))
f.where($"month".isNull)
f.filter($"month".isNotNull)
17
SQL: AGGREGATE FUNCTIONS
SELECT SUM(domestic_flights),
AVG(domestic_flights),
COUNT(domestic_flights)
FROM f
SELECT year,
MIN(domestic_flights) as min,
MAX(domestic_flights) as max,
AVG(domestic_flights) as avg
FROM f
GROUP BY year
HAVING min > 750000
ORDER BY year
f.select( sum("domestic_flights"),
avg("domestic_flights"),
count("domestic_flights"))
f.groupBy($"year").
agg(min($"domestic_flights") as "min",
max($"domestic_flights") as "max",
avg($"domestic_flights") as "avg").
filter($"min" > 750000).
orderBy($"year")
18
SQL: JOIN – USING JOIN
scala> f.join(p, f("year") === p("year") && f("month") === p("month")).printSchema
root
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- domestic_flights: long (nullable = true)
|-- international_flights: long (nullable = true)
|-- total_flights: long (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- domestic_passengers: long (nullable = true)
|-- international_passengers: long (nullable = true)
|-- total_passengers: long (nullable = true)
19
SQL: JOIN – USING JOINWITH
scala> f.joinWith(p, f("year") === p("year") && f("month") === p("month")).printSchema
root
|-- _1: struct (nullable = false)
| |-- year: integer (nullable = true)
| |-- month: integer (nullable = true)
| |-- domestic_flights: long (nullable = true)
| |-- international_flights: long (nullable = true)
| |-- total_flights: long (nullable = true)
|-- _2: struct (nullable = false)
| |-- year: integer (nullable = true)
| |-- month: integer (nullable = true)
| |-- domestic_passengers: long (nullable = true)
| |-- international_passengers: long (nullable = true)
| |-- total_passengers: long (nullable = true)
20
JOIN TYPES
f.joinWith(p, f("year") === p("year") && f("month") === p("month"), "inner")
Supported join types include:
'inner',
'outer',
'full',
'fullouter',
'leftouter',
'left',
'rightouter',
'right',
'leftsemi',
'leftanti'.
f p
21
OPERATIONS ON JOIN RESULT
f.join(p, f("year") === p("year") &&
f("month") === p("month")).
groupBy(f("year")).
agg(collect_list("total_flights")
as "total_flight_list",
collect_list("total_passengers")
as "total_passengers").printSchema
root
|-- year: integer (nullable = true)
|-- total_flight_list: array (nullable = true)
| |-- element: long (containsNull = true)
|-- total_passengers: array (nullable = true)
| |-- element: long (containsNull = true)
f.joinWith(p, f("year") === p("year") &&
f("month") === p("month")).
groupBy($"_1.year").
agg(collect_list($"_1.total_flights")
as "total_flight_list",
collect_list($"_2.total_passengers")
as "total_passengers").printSchema
root
|-- year: integer (nullable = true)
|-- total_flight_list: array (nullable = true)
| |-- element: long (containsNull = true)
|-- total_passengers: array (nullable = true)
| |-- element: long (containsNull = true)
22
SQL: AGGREGATE FUNCTIONS
23
SQL: FUNCTIONS
• Aggregate functions
• Collection functions
• Date Time functions
• Math functions
• Misc. (hashing) functions
• Non-aggregate Functions
• Sorting functions
• String functions
• UDF
• Window functions
Need to "import" before any SQL
"functions" can be used
24
SQL: SLIDING WINDOW
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"year").
orderBy($"year", $"month")
f.withColumn("month + 1",
lead($"total_flights", 1) over w).
withColumn("month + 2",
lead($"total_flights", 2) over w).
select($"year", $"month",
$"total_flights" as "current month",
$"month + 1", $"month + 2").
filter($"month".isin(1,4,7,10)).
orderBy($"year", $"month").show
3 months of quarterly data in a row
25
SQL: SLIDING WINDOW AGGREGATION
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"year").
rowsBetween(-1, +1)
f.withColumn("3-month avg",
avg($"total_flights") over w).
select($"year", $"month",
$"total_flights", $"3-month avg").
orderBy($"year", $"month").show
3 months sliding average (previous, current, next month)
26
SQL: SLIDING WINDOW CUMULATIVE
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"year").
rowsBetween(Long.MinValue, 0)
f.withColumn("cumulative total flights",
sum($"total_flights") over w).
select($"year", $"month",
$"total_flights", $"cumulative total flights").
orderBy($"year", $"month").show
Running total within each year
27
SQL: SLIDING WINDOW RANKING
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"year").
orderBy($"total_flights".desc)
f.withColumn("rank",
rank over w).
select($"year", $"month",
$"total_flights", $"rank").
orderBy($"year", $"month").show
Rank within each year
28
SQL: OVERALL RANKING
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy().
orderBy($"total_flights".desc)
f.withColumn("rank", rank over w).
select($"year", $"month", $"total_flights", $"rank").
orderBy($"year", $"month").show
Rank of each row in dataset
29
SQL: OVERALL RUNNING TOTAL
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy().
rowsBetween(Long.MinValue, 0)
f.withColumn("cumulative total flights",
sum($"total_flights") over w).
select($"year", $"month",
$"total_flights",
$"cumulative total flights").
orderBy($"year", $"month").show
30
• Aggregate functions
• Collection functions
• Date Time functions
• Math functions
• Misc. (hashing) functions
• Non-aggregate Functions
• Sorting functions
• String functions
• UDF
• Window functions
SQL: WINDOWING – MORE INFO
31
SQL: CUBES
f.cube("year", "month").
agg(sum("total_flights")).
filter($"month".isNull || $"year".isNull).
orderBy($"year", $"month").show()
32
References
33
SQL UNDER-THE-HOOD
• 2016 Spark Summit: Structuring Apache Spark 2.0: SQL, DataFrames,
Datasets And Streaming (Michael Armbrust)
http://www.slideshare.net/databricks/structuring-spark-dataframes-
datasets-and-streaming-62871797
• 2016 Spark Summit: Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-
spark-20s-optimizer-63071120 (Yin Huai)
34
Questions?
35

Apache big-data-2017-scala-sql

  • 1.
    1 © 2016,Conversant, LLC. All rights reserved. SCALA + SQL = UNION OF TWO EQUALS APACHE BIG DATA NORTH AMERICA 2017 PRESENTED BY: JAYESH THAKRAR SENIOR SOFTWARE ENGINEER
  • 2.
    2 SCALA + SQL= UNION OF TWO EQUALS 1. What does it mean? Why does it matter? 2. Is Scala == SQL? Show me! 3. Final Notes...
  • 3.
  • 4.
    4 JDBC / ODBC HostLanguage C, Java, Python, etc. Embedded SQL • SQL = Set-level processing • Host Language = Row-level, iterative processing using single iterator (like drinking from a firehose!)
  • 5.
    5 SPARK FRAMEWORK SCALA RPython SQL • Commonpractice: Scala, Python, R for data in / out SQL for data processing • Scala = Spark Scala API • Scala fuses seamlessly with Spark DSL • Can munge SQL result as a distributed collection
  • 6.
    6 WHY USE SCALAAPI? • Its cool! • Compact code and often more intuitive & self-explanatory • Additional help from IDEs when using Scala API • Concatenated methods /pipeline help visualize processing flow • Ability to mix-and-match Scala API and SQL • Ability to do "in-line-processing" using lambda functions • Scala implicits allow further extending Scala API
  • 7.
    7 GOALS OF THISTALK • Show Scala (API) can do the same as SQL • Make using Scala API more comfortable
  • 8.
    8 Is Scala ==SQL? Show me!
  • 9.
    9 SQL AND SCALAAPI UNDER-THE-HOOD Source: https://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s- optimizer?qid=8b68ce4b-5aae-4e53-adf1-5cf5259a78b6&v=&b=&from_search=1
  • 10.
    10 SAMPLE DATA Derived from:https://www.transtats.bts.gov/Data_Elements.aspx?Data=1 Flight Data Passenger Data
  • 11.
    11 SAMPLE DATA SCHEMA scala>f res9: org.apache.spark.sql.Dataset[Flight] = [year: int, month: int ... 3 more fields] scala> f.printSchema root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- domestic_flights: long (nullable = true) |-- international_flights: long (nullable = true) |-- total_flights: long (nullable = true) scala> p res11: org.apache.spark.sql.Dataset[Passenger] = [year: int, month: int ... 3 more fields] scala> p.printSchema root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- domestic_passengers: long (nullable = true) |-- international_passengers: long (nullable = true) |-- total_passengers: long (nullable = true)
  • 12.
    12 SQL: SELECT CLAUSE/ PROJECTION SELECT year FROM f SELECT year AS yyyy FROM f f.select("year") f.select($"year") f.select(f("year")) f.select(col("year")) f.select($"year" AS "yyyy") f.withColumn($"year" AS "yyyy"). select($"yyyy")
  • 13.
  • 14.
    14 SQL: NEW COLUMNS,EXPRESSIONS // arithmetic expression f.select($"domestic_flights" / $"total_flights" as "flight_ratio") // boolean expression f.select($"domestic_flights" + $"international_flights" === $"total_flights") // conditional expression f.select($"domestic_flights" / $"international_flights" > 9.0 as "too_high") // Adding a new column f.withColumn($"domestic_flights" / $"international_flights" as "flight_ratio")
  • 15.
    15 SQL: CASE SELECT year, CASE WHENmonth = 1 THEN "Jan" WHEN month = 2 THEN "Feb" ..... ELSE "Error" END f.select($"year", when($"month" === 1, "Jan"). when($"month" === 2, "Feb"). when("$month" === 3, "Mar"). ..... otherwise("Error") )
  • 16.
    16 SQL: WHERE CLAUSE WHEREmonth = 12 WHERE (month = 12 AND year > 2010) OR (domestic_flights > 880000) WHERE year IN (2004, 2007) WHERE month IS NULL WHERE month IS NOT NULL f.where("month = 12") // SQL syntax f.where($"month" === 12) // API syntax f.where( ("month" === 12 && $("year") > 2010) || f("domestic_flights") > 880000) f.filter(f("year").isin(2004, 2007)) f.where($"month".isNull) f.filter($"month".isNotNull)
  • 17.
    17 SQL: AGGREGATE FUNCTIONS SELECTSUM(domestic_flights), AVG(domestic_flights), COUNT(domestic_flights) FROM f SELECT year, MIN(domestic_flights) as min, MAX(domestic_flights) as max, AVG(domestic_flights) as avg FROM f GROUP BY year HAVING min > 750000 ORDER BY year f.select( sum("domestic_flights"), avg("domestic_flights"), count("domestic_flights")) f.groupBy($"year"). agg(min($"domestic_flights") as "min", max($"domestic_flights") as "max", avg($"domestic_flights") as "avg"). filter($"min" > 750000). orderBy($"year")
  • 18.
    18 SQL: JOIN –USING JOIN scala> f.join(p, f("year") === p("year") && f("month") === p("month")).printSchema root |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- domestic_flights: long (nullable = true) |-- international_flights: long (nullable = true) |-- total_flights: long (nullable = true) |-- year: integer (nullable = true) |-- month: integer (nullable = true) |-- domestic_passengers: long (nullable = true) |-- international_passengers: long (nullable = true) |-- total_passengers: long (nullable = true)
  • 19.
    19 SQL: JOIN –USING JOINWITH scala> f.joinWith(p, f("year") === p("year") && f("month") === p("month")).printSchema root |-- _1: struct (nullable = false) | |-- year: integer (nullable = true) | |-- month: integer (nullable = true) | |-- domestic_flights: long (nullable = true) | |-- international_flights: long (nullable = true) | |-- total_flights: long (nullable = true) |-- _2: struct (nullable = false) | |-- year: integer (nullable = true) | |-- month: integer (nullable = true) | |-- domestic_passengers: long (nullable = true) | |-- international_passengers: long (nullable = true) | |-- total_passengers: long (nullable = true)
  • 20.
    20 JOIN TYPES f.joinWith(p, f("year")=== p("year") && f("month") === p("month"), "inner") Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi', 'leftanti'. f p
  • 21.
    21 OPERATIONS ON JOINRESULT f.join(p, f("year") === p("year") && f("month") === p("month")). groupBy(f("year")). agg(collect_list("total_flights") as "total_flight_list", collect_list("total_passengers") as "total_passengers").printSchema root |-- year: integer (nullable = true) |-- total_flight_list: array (nullable = true) | |-- element: long (containsNull = true) |-- total_passengers: array (nullable = true) | |-- element: long (containsNull = true) f.joinWith(p, f("year") === p("year") && f("month") === p("month")). groupBy($"_1.year"). agg(collect_list($"_1.total_flights") as "total_flight_list", collect_list($"_2.total_passengers") as "total_passengers").printSchema root |-- year: integer (nullable = true) |-- total_flight_list: array (nullable = true) | |-- element: long (containsNull = true) |-- total_passengers: array (nullable = true) | |-- element: long (containsNull = true)
  • 22.
  • 23.
    23 SQL: FUNCTIONS • Aggregatefunctions • Collection functions • Date Time functions • Math functions • Misc. (hashing) functions • Non-aggregate Functions • Sorting functions • String functions • UDF • Window functions Need to "import" before any SQL "functions" can be used
  • 24.
    24 SQL: SLIDING WINDOW importorg.apache.spark.sql.expressions.Window val w = Window.partitionBy($"year"). orderBy($"year", $"month") f.withColumn("month + 1", lead($"total_flights", 1) over w). withColumn("month + 2", lead($"total_flights", 2) over w). select($"year", $"month", $"total_flights" as "current month", $"month + 1", $"month + 2"). filter($"month".isin(1,4,7,10)). orderBy($"year", $"month").show 3 months of quarterly data in a row
  • 25.
    25 SQL: SLIDING WINDOWAGGREGATION import org.apache.spark.sql.expressions.Window val w = Window.partitionBy($"year"). rowsBetween(-1, +1) f.withColumn("3-month avg", avg($"total_flights") over w). select($"year", $"month", $"total_flights", $"3-month avg"). orderBy($"year", $"month").show 3 months sliding average (previous, current, next month)
  • 26.
    26 SQL: SLIDING WINDOWCUMULATIVE import org.apache.spark.sql.expressions.Window val w = Window.partitionBy($"year"). rowsBetween(Long.MinValue, 0) f.withColumn("cumulative total flights", sum($"total_flights") over w). select($"year", $"month", $"total_flights", $"cumulative total flights"). orderBy($"year", $"month").show Running total within each year
  • 27.
    27 SQL: SLIDING WINDOWRANKING import org.apache.spark.sql.expressions.Window val w = Window.partitionBy($"year"). orderBy($"total_flights".desc) f.withColumn("rank", rank over w). select($"year", $"month", $"total_flights", $"rank"). orderBy($"year", $"month").show Rank within each year
  • 28.
    28 SQL: OVERALL RANKING importorg.apache.spark.sql.expressions.Window val w = Window.partitionBy(). orderBy($"total_flights".desc) f.withColumn("rank", rank over w). select($"year", $"month", $"total_flights", $"rank"). orderBy($"year", $"month").show Rank of each row in dataset
  • 29.
    29 SQL: OVERALL RUNNINGTOTAL import org.apache.spark.sql.expressions.Window val w = Window.partitionBy(). rowsBetween(Long.MinValue, 0) f.withColumn("cumulative total flights", sum($"total_flights") over w). select($"year", $"month", $"total_flights", $"cumulative total flights"). orderBy($"year", $"month").show
  • 30.
    30 • Aggregate functions •Collection functions • Date Time functions • Math functions • Misc. (hashing) functions • Non-aggregate Functions • Sorting functions • String functions • UDF • Window functions SQL: WINDOWING – MORE INFO
  • 31.
  • 32.
  • 33.
    33 SQL UNDER-THE-HOOD • 2016Spark Summit: Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming (Michael Armbrust) http://www.slideshare.net/databricks/structuring-spark-dataframes- datasets-and-streaming-62871797 • 2016 Spark Summit: Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache- spark-20s-optimizer-63071120 (Yin Huai)
  • 34.
  • 35.