Apache big-data-2017-scala-sql

1 © 2016, Conversant, LLC. All rights reserved.
SCALA + SQL = UNION OF TWO EQUALS
APACHE BIG DATA NORTH AMERICA 2017 PRESENTED BY:
JAYESH THAKRAR
SENIOR SOFTWARE ENGINEER

2
SCALA + SQL = UNION OF TWO EQUALS
1. What does it mean?
Why does it matter?
2. Is Scala == SQL?
Show me!
3. Final Notes...

4
JDBC / ODBC
Host Language
C,
Java,
Python,
etc.
Embedded
SQL
• SQL = Set-level processing
• Host Language = Row-level,
iterative processing using
single iterator
(like drinking from a firehose!)

5
SPARK FRAMEWORK
SCALA
RPython
SQL
• Common practice:
Scala, Python, R for data in / out
SQL for data processing
• Scala = Spark Scala API
• Scala fuses seamlessly
with Spark DSL
• Can munge SQL result as
a distributed collection

6
WHY USE SCALA API?
• Its cool!
• Compact code and often more intuitive & self-explanatory
• Additional help from IDEs when using Scala API
• Concatenated methods /pipeline help visualize processing flow
• Ability to mix-and-match Scala API and SQL
• Ability to do "in-line-processing" using lambda functions
• Scala implicits allow further extending Scala API

7
GOALS OF THIS TALK
• Show Scala (API) can do the same as SQL
• Make using Scala API more comfortable

9
SQL AND SCALA API UNDER-THE-HOOD
Source: https://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s-
optimizer?qid=8b68ce4b-5aae-4e53-adf1-5cf5259a78b6&v=&b=&from_search=1

10
SAMPLE DATA
Derived from: https://www.transtats.bts.gov/Data_Elements.aspx?Data=1
Flight Data Passenger Data

11
SAMPLE DATA SCHEMA
scala> f
res9: org.apache.spark.sql.Dataset[Flight] =
[year: int, month: int ... 3 more fields]
scala> f.printSchema
root
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- domestic_flights: long (nullable = true)
|-- international_flights: long (nullable = true)
|-- total_flights: long (nullable = true)
scala> p
res11: org.apache.spark.sql.Dataset[Passenger] =
[year: int, month: int ... 3 more fields]
scala> p.printSchema
root
|-- domestic_passengers: long (nullable = true)
|-- international_passengers: long (nullable = true)
|-- total_passengers: long (nullable = true)

12
SQL: SELECT CLAUSE / PROJECTION
SELECT year FROM f
SELECT year AS yyyy
FROM f
f.select("year")
f.select($"year")
f.select(f("year"))
f.select(col("year"))
f.select($"year" AS "yyyy")
f.withColumn($"year" AS "yyyy").
select($"yyyy")

14
SQL: NEW COLUMNS, EXPRESSIONS
// arithmetic expression
f.select($"domestic_flights" / $"total_flights" as "flight_ratio")
// boolean expression
f.select($"domestic_flights" + $"international_flights" === $"total_flights")
// conditional expression
f.select($"domestic_flights" / $"international_flights" > 9.0 as "too_high")
// Adding a new column
f.withColumn($"domestic_flights" / $"international_flights" as "flight_ratio")

15
SQL: CASE
SELECT year,
CASE
WHEN month = 1
THEN "Jan"
WHEN month = 2
THEN "Feb"
.....
ELSE "Error"
END
f.select($"year",
when($"month" === 1, "Jan").
when($"month" === 2, "Feb").
when("$month" === 3, "Mar").
.....
otherwise("Error")
)

16
SQL: WHERE CLAUSE
WHERE month = 12
WHERE (month = 12 AND year > 2010)
OR (domestic_flights > 880000)
WHERE year IN (2004, 2007)
WHERE month IS NULL
WHERE month IS NOT NULL
f.where("month = 12") // SQL syntax
f.where($"month" === 12) // API syntax
f.where(
("month" === 12 && $("year") > 2010)
|| f("domestic_flights") > 880000)
f.filter(f("year").isin(2004, 2007))
f.where($"month".isNull)
f.filter($"month".isNotNull)

17
SQL: AGGREGATE FUNCTIONS
SELECT SUM(domestic_flights),
AVG(domestic_flights),
COUNT(domestic_flights)
FROM f
SELECT year,
MIN(domestic_flights) as min,
MAX(domestic_flights) as max,
AVG(domestic_flights) as avg
FROM f
GROUP BY year
HAVING min > 750000
ORDER BY year
f.select( sum("domestic_flights"),
avg("domestic_flights"),
count("domestic_flights"))
f.groupBy($"year").
agg(min($"domestic_flights") as "min",
max($"domestic_flights") as "max",
avg($"domestic_flights") as "avg").
filter($"min" > 750000).
orderBy($"year")

18
SQL: JOIN – USING JOIN
scala> f.join(p, f("year") === p("year") && f("month") === p("month")).printSchema
root
|-- domestic_flights: long (nullable = true)
|-- international_flights: long (nullable = true)
|-- total_flights: long (nullable = true)
|-- domestic_passengers: long (nullable = true)
|-- international_passengers: long (nullable = true)
|-- total_passengers: long (nullable = true)

19
SQL: JOIN – USING JOINWITH
scala> f.joinWith(p, f("year") === p("year") && f("month") === p("month")).printSchema
root
|-- _1: struct (nullable = false)
| |-- year: integer (nullable = true)
| |-- month: integer (nullable = true)
| |-- domestic_flights: long (nullable = true)
| |-- international_flights: long (nullable = true)
| |-- total_flights: long (nullable = true)
|-- _2: struct (nullable = false)
| |-- year: integer (nullable = true)
| |-- month: integer (nullable = true)
| |-- domestic_passengers: long (nullable = true)
| |-- international_passengers: long (nullable = true)
| |-- total_passengers: long (nullable = true)

20
JOIN TYPES
f.joinWith(p, f("year") === p("year") && f("month") === p("month"), "inner")
Supported join types include:
'inner',
'outer',
'full',
'fullouter',
'leftouter',
'left',
'rightouter',
'right',
'leftsemi',
'leftanti'.
f p

21
OPERATIONS ON JOIN RESULT
f.join(p, f("year") === p("year") &&
f("month") === p("month")).
groupBy(f("year")).
agg(collect_list("total_flights")
as "total_flight_list",
collect_list("total_passengers")
as "total_passengers").printSchema
root
|-- total_flight_list: array (nullable = true)
| |-- element: long (containsNull = true)
|-- total_passengers: array (nullable = true)
f.joinWith(p, f("year") === p("year") &&
f("month") === p("month")).
groupBy($"_1.year").
agg(collect_list($"_1.total_flights")
as "total_flight_list",
collect_list($"_2.total_passengers")
as "total_passengers").printSchema
root
|-- total_flight_list: array (nullable = true)
|-- total_passengers: array (nullable = true)

23
SQL: FUNCTIONS
• Aggregate functions
• Collection functions
• Date Time functions
• Math functions
• Misc. (hashing) functions
• Non-aggregate Functions
• Sorting functions
• String functions
• UDF
• Window functions
Need to "import" before any SQL
"functions" can be used

24
SQL: SLIDING WINDOW
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"year").
orderBy($"year", $"month")
f.withColumn("month + 1",
lead($"total_flights", 1) over w).
withColumn("month + 2",
lead($"total_flights", 2) over w).
select($"year", $"month",
$"total_flights" as "current month",
$"month + 1", $"month + 2").
filter($"month".isin(1,4,7,10)).
orderBy($"year", $"month").show
3 months of quarterly data in a row

25
SQL: SLIDING WINDOW AGGREGATION
rowsBetween(-1, +1)
f.withColumn("3-month avg",
avg($"total_flights") over w).
$"total_flights", $"3-month avg").
3 months sliding average (previous, current, next month)

26
SQL: SLIDING WINDOW CUMULATIVE
rowsBetween(Long.MinValue, 0)
f.withColumn("cumulative total flights",
sum($"total_flights") over w).
$"total_flights", $"cumulative total flights").
Running total within each year

27
SQL: SLIDING WINDOW RANKING
orderBy($"total_flights".desc)
f.withColumn("rank",
rank over w).
$"total_flights", $"rank").
Rank within each year

28
SQL: OVERALL RANKING
val w = Window.partitionBy().
orderBy($"total_flights".desc)
f.withColumn("rank", rank over w).
select($"year", $"month", $"total_flights", $"rank").
Rank of each row in dataset

29
SQL: OVERALL RUNNING TOTAL
val w = Window.partitionBy().
rowsBetween(Long.MinValue, 0)
f.withColumn("cumulative total flights",
sum($"total_flights") over w).
$"total_flights",
$"cumulative total flights").

30
• Aggregate functions
• Collection functions
• Date Time functions
• Math functions
• Misc. (hashing) functions
• Non-aggregate Functions
• Sorting functions
• String functions
• UDF
• Window functions
SQL: WINDOWING – MORE INFO

31
SQL: CUBES
f.cube("year", "month").
agg(sum("total_flights")).
filter($"month".isNull || $"year".isNull).
orderBy($"year", $"month").show()

33
SQL UNDER-THE-HOOD
• 2016 Spark Summit: Structuring Apache Spark 2.0: SQL, DataFrames,
Datasets And Streaming (Michael Armbrust)
http://www.slideshare.net/databricks/structuring-spark-dataframes-
datasets-and-streaming-62871797
• 2016 Spark Summit: Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
http://www.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-
spark-20s-optimizer-63071120 (Yin Huai)

Apache big-data-2017-scala-sql

More Related Content

Similar to Apache big-data-2017-scala-sql

More from Jayesh Thakrar

Recently uploaded

Apache big-data-2017-scala-sql