Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Comprehensive View on
Intervals in Apache Spark 3.2
Maxim Gekk
Software Engineer @ Databricks
About me
Databricks Software Engineer
Apache Spark committer
@MaxGekk
Agenda
▪ Overview of new interval
types in Spark 3.2
▪ Limitations of
CalendarIntervalType
▪ Year-Month Interval
▪ Day-Tim...
SPARK-27790
Support ANSI SQL INTERVAL types
• Spark SQL 3.2 releases two
new Catalyst’s types: year-
month interval and da...
CalendarIntervalType
is combination of
(months, days, microseconds)
Problems of
CalendarIntervalType:
not comparable
Problems of
CalendarIntervalType:
unordered
Problems of
CalendarIntervalType:
I. Cannot be persistent to any
external storages
||. Inefficient memory usage
~12 bytes ...
SQL standard
interval types
Year-Month Interval = (YEAR, MONTH)
Day-Time Interval = (DAY, HOUR, MINUTE, SECOND)
New Catalyst types in Apache Spark 3.2
▪ Precision: months
▪ Comparable and orderable
▪ Value size: 4 bytes
▪ Minimal valu...
Creation of interval columns
▪ Interval literals:
INTERVAL ‘1-1’ YEAR TO
MONTH
INTERVAL ‘1 02:03:04’ DAY
TO SECOND
▪ Casti...
Operations involving datetimes and intervals
Arithmetic operations involving values of type datetime or interval obey the ...
Interval operations in Apache Spark 3.2
▪ YearMonthIntervalType [* | /]
NumericType
▪ YearMonthIntervalType [+ | -]
YearMo...
date + day-time interval
[SPARK-35051][SQL] Support add/subtract of a day-time interval to/from a date
spark.sql.legacy.interval.enabled
• When set to true, Spark SQL uses the mixed legacy interval
type CalendarIntervalType i...
Daylight saving time
External Java types
▪ This class models a quantity or
amount of time in terms of years,
months and days. Spark takes years...
ANSI intervals in UDF/UDAF
Day-Time Interval
Year-Month Interval
Specification of interval types in schemas
• Day-Time Interval Type
• Year-Month Interval type
▪ CREATE TABLE tbl (
id INT...
SPARK-27790: Support ANSI SQL INTERVAL types:
Milestone 1 – Spark Interval equivalency ( The new interval types meet or ex...
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Comprehensive View on Intervals in Apache Spark 3.2

Download to read offline

Overview of intervals in Apache Spark before version 3.2, and the changes that are coming in the future releases. Conformance to the ANSI SQL standard. We will discuss existing issues of interval APIs, and how the issues will be solved by new types: year-month interval and day-time interval. I will show how to use the Interval API of Spark SQL and PySpark, and how to avoid potential problems. I will demonstrate construction of intervals from external types, and saving/loading via Spark’s built-in datasources.

  • Be the first to like this

Comprehensive View on Intervals in Apache Spark 3.2

  1. 1. Comprehensive View on Intervals in Apache Spark 3.2 Maxim Gekk Software Engineer @ Databricks
  2. 2. About me Databricks Software Engineer Apache Spark committer @MaxGekk
  3. 3. Agenda ▪ Overview of new interval types in Spark 3.2 ▪ Limitations of CalendarIntervalType ▪ Year-Month Interval ▪ Day-Time Interval
  4. 4. SPARK-27790 Support ANSI SQL INTERVAL types • Spark SQL 3.2 releases two new Catalyst’s types: year- month interval and day- time interval types • CalendarIntervalType is not recommended to use, and will be deprecated.
  5. 5. CalendarIntervalType is combination of (months, days, microseconds)
  6. 6. Problems of CalendarIntervalType: not comparable
  7. 7. Problems of CalendarIntervalType: unordered
  8. 8. Problems of CalendarIntervalType: I. Cannot be persistent to any external storages ||. Inefficient memory usage ~12 bytes per value |||. Incompatible to the SQL standard
  9. 9. SQL standard interval types
  10. 10. Year-Month Interval = (YEAR, MONTH) Day-Time Interval = (DAY, HOUR, MINUTE, SECOND)
  11. 11. New Catalyst types in Apache Spark 3.2 ▪ Precision: months ▪ Comparable and orderable ▪ Value size: 4 bytes ▪ Minimal value: INTERVAL ‘-178956970-8’ YEAR TO MONTH ▪ Maximum value: INTERVAL ‘178956970-7’ YEAR TO MONTH ▪ Precision: microseconds ▪ Comparable and orderable ▪ Value size: 8 bytes ▪ Minimal value: INTERVAL ‘106751991 04:00:54.775807’ DAY TO SECOND ▪ Maximum value: INTERVAL ‘-106751991 04:00:54.775808’ DAY TO SECOND • YearMonthIntervalType • DayTimeIntervalType
  12. 12. Creation of interval columns ▪ Interval literals: INTERVAL ‘1-1’ YEAR TO MONTH INTERVAL ‘1 02:03:04’ DAY TO SECOND ▪ Casting string to interval types: $”col” .cast(YearMonthIntervalTyp e) • Parallelize collections of java.time.Period: Seq(Period.ofDays(10)).toDS • From collection of java.time.Duration: Seq(Duration.ofDays(10)).toD S • From external types • From interval strings • Function-constructor of interval types: make_interval(1, 2) make_interval(1, 2, 3, 4, 5.123) • From integral fields
  13. 13. Operations involving datetimes and intervals Arithmetic operations involving values of type datetime or interval obey the natural rules associated with dates and times and yield valid datetime or interval results according to the Gregorian calendar.
  14. 14. Interval operations in Apache Spark 3.2 ▪ YearMonthIntervalType [* | /] NumericType ▪ YearMonthIntervalType [+ | -] YearMonthIntervalType ▪ DayTimeIntervalType [+ | - | * | /] NumericType ▪ TimestampType - TimestampType ▪ DateType - DateType DayTimeIntervalType = YearMonthIntervalType = ▪ DateType [+ | -] YearMonthIntervalType = DateType ▪ TimestampType [+ | -] YearMonthIntervalType = TimestampType ▪ DateType [+ | -] DayTimeIntervalType = TimestampType ▪ TimestampType [+ | -] DayTimeIntervalType = TimestampType
  15. 15. date + day-time interval [SPARK-35051][SQL] Support add/subtract of a day-time interval to/from a date
  16. 16. spark.sql.legacy.interval.enabled • When set to true, Spark SQL uses the mixed legacy interval type CalendarIntervalType instead of the ANSI compliant interval types YearMonthIntervalType and DayTimeIntervalType. • It impacts on: • Dates and timestamp subtractions • Parsing of ANSI interval literals: INTERVAL ‘1 02:03:04’ DAY TO SECOND
  17. 17. Daylight saving time
  18. 18. External Java types ▪ This class models a quantity or amount of time in terms of years, months and days. Spark takes years and months fields only. ▪ This class models a quantity or amount of time in terms of seconds and nanoseconds. Spark casts the nanoseconds to microseconds. java.time.Duration java.time.Period
  19. 19. ANSI intervals in UDF/UDAF Day-Time Interval Year-Month Interval
  20. 20. Specification of interval types in schemas • Day-Time Interval Type • Year-Month Interval type ▪ CREATE TABLE tbl ( id INT, delay INTERVAL YEAR TO MONTH ) ▪ CREATE TABLE tbl ( len INT, tout INTERVAL DAY TO SECOND )
  21. 21. SPARK-27790: Support ANSI SQL INTERVAL types: Milestone 1 – Spark Interval equivalency ( The new interval types meet or exceed all function of the existing SQL Interval) Milestone 2 – Persistence: Ability to create tables of type interval Ability to write to common file formats such as Parquet and JSON. INSERT, SELECT, UPDATE, MERGE Discovery Milestone 3 – Client support JDBC support Hive Thrift server Milestone 4 – PySpark and Spark R integration Python UDF can take and return intervals DataFrame support
  22. 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Overview of intervals in Apache Spark before version 3.2, and the changes that are coming in the future releases. Conformance to the ANSI SQL standard. We will discuss existing issues of interval APIs, and how the issues will be solved by new types: year-month interval and day-time interval. I will show how to use the Interval API of Spark SQL and PySpark, and how to avoid potential problems. I will demonstrate construction of intervals from external types, and saving/loading via Spark’s built-in datasources.

Views

Total views

148

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×