Introduction to Dataset API
Overcoming limitations of Dataframes
https://github.com/shashankgowdal/introduction_to_dataset
● Shashank L
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
Agenda
● History of Spark APIs
● Limitations of Dataframes
● Dataset
● Encoders
● Dataset hierarchy
● Performance
● Roadmap
RDD API (2011)
● Distributed collection for JVM objects
● Immutable and Fault tolerant
● Processing structured and unstructured data
● Functional transformations
Limitations of RDD API
● No schema associated
● Optimization should be done by from user’s end
● Reading from multiple sources is difficult
● Combining multiple sources is difficult
DataFrame API (2013)
● Distributed collection for Row objects
● Immutable and Fault tolerant
● Processing structured data
● Optimization from Catalyst optimizer
● Data source API
Limitations of Dataframe
● Compile time type safety
● Cannot operate on domain objects
● Functional programming API
Compile time safety
val dataframe = sqlContext.read.json("people.json")
dataframe.filter("salary > 1000").show()
Throws Runtime exception
org.apache.spark.sql.AnalysisException: cannot resolve 'salary' given input columns age, name;
Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDF = sqlContext.createDataFrame(personRDD)
//Create dataframe from a RDD[Person]
personDF.rdd
//We get back RDD[Row] and not RDD[Person]
Dataset
Dataset
an extension of the DataFrame API that provides a type-safe,
object-oriented programming interface
Dataset API
● Type-safe: Operate on domain objects with compiled
lambda functions
● Fast: Code generated encoders for fast serialization
● Interoperable: Easily convert Dataframe Dataset
without boilerplate code
Dataset API
Encoders
● Encoder converts from JVM object into a Dataset
row
● Code generated encoders for fast serialization
JVM Object
Dataset row
Encoder
Compile time safety check
case class Person(name: String, age: Long)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.age > 25)
ds.filter(p => p.salary > 12500)
//error: value salary is not a member of Person
Operating on domain objects
val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20)))
//Create RDD[Person]
val personDS = sqlContext.createDataset(personRDD)
//Create Dataset from a RDD
personDS.rdd
//We get back RDD[Person] and not RDD[Row] in Dataframe
Functional programming
case class Person(name: String, age: Int)
val dataframe = sqlContext.read.json("people.json")
val ds : Dataset[Person] = dataframe.as[Person]
// Compute histogram of age by name
val hist = ds.groupBy(_.name).mapGroups({
case (name, people) => {
val buckets = new Array[Int](10)
people.map(_.age).foreach { a =>
buckets(a / 10) += 1
}
(name, buckets)
}
})
Dataset hierarchy
SQL
Dataframe (& Dataset)
Python R Scala/Java
Tungsten execution
Hands on
● Creating dataset
○ From Collections
○ From File
● Comparison with RDD
○ Operations
○ Distributed Wordcount
● Semistructured data
○ Downcast
○ Upcast
Execution performance
Caching memory usage
Serialization performance
Roadmap
● Dataset, the name itself may change
● Performance optimizations
● Unification of DataFrames with Dataset
● Public API for Encoders
● Support for most of the RDD operators on Dataset
Unification of DataFrames with Dataset
class Dataset[T](
val sqlContext: SQLContext,
val queryExecution: QueryExecution)(
implicit val encoder: Encoder[T])
class DataFrame(
sqlContext: SQLContext,
queryExecution: QueryExecution)
extends Dataset[Row](sqlContext, queryExecution)(new RowEncoder)
References
● https://databricks.com/blog/2016/01/04/introducing-spark-
datasets.html
● https://issues.apache.org/jira/browse/SPARK-9999
● https://goo.gl/Wqc561 - API design

Introduction to dataset

  • 1.
    Introduction to DatasetAPI Overcoming limitations of Dataframes https://github.com/shashankgowdal/introduction_to_dataset
  • 2.
    ● Shashank L ●Big data consultant and trainer at datamantra.io ● www.shashankgowda.com
  • 3.
    Agenda ● History ofSpark APIs ● Limitations of Dataframes ● Dataset ● Encoders ● Dataset hierarchy ● Performance ● Roadmap
  • 4.
    RDD API (2011) ●Distributed collection for JVM objects ● Immutable and Fault tolerant ● Processing structured and unstructured data ● Functional transformations
  • 5.
    Limitations of RDDAPI ● No schema associated ● Optimization should be done by from user’s end ● Reading from multiple sources is difficult ● Combining multiple sources is difficult
  • 6.
    DataFrame API (2013) ●Distributed collection for Row objects ● Immutable and Fault tolerant ● Processing structured data ● Optimization from Catalyst optimizer ● Data source API
  • 7.
    Limitations of Dataframe ●Compile time type safety ● Cannot operate on domain objects ● Functional programming API
  • 8.
    Compile time safety valdataframe = sqlContext.read.json("people.json") dataframe.filter("salary > 1000").show() Throws Runtime exception org.apache.spark.sql.AnalysisException: cannot resolve 'salary' given input columns age, name;
  • 9.
    Operating on domainobjects val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20))) //Create RDD[Person] val personDF = sqlContext.createDataFrame(personRDD) //Create dataframe from a RDD[Person] personDF.rdd //We get back RDD[Row] and not RDD[Person]
  • 10.
  • 11.
    Dataset an extension ofthe DataFrame API that provides a type-safe, object-oriented programming interface
  • 12.
    Dataset API ● Type-safe:Operate on domain objects with compiled lambda functions ● Fast: Code generated encoders for fast serialization ● Interoperable: Easily convert Dataframe Dataset without boilerplate code
  • 13.
  • 14.
    Encoders ● Encoder convertsfrom JVM object into a Dataset row ● Code generated encoders for fast serialization JVM Object Dataset row Encoder
  • 15.
    Compile time safetycheck case class Person(name: String, age: Long) val dataframe = sqlContext.read.json("people.json") val ds : Dataset[Person] = dataframe.as[Person] ds.filter(p => p.age > 25) ds.filter(p => p.salary > 12500) //error: value salary is not a member of Person
  • 16.
    Operating on domainobjects val personRDD = sc.makeRDD(Seq(Person("A",10), Person("B",20))) //Create RDD[Person] val personDS = sqlContext.createDataset(personRDD) //Create Dataset from a RDD personDS.rdd //We get back RDD[Person] and not RDD[Row] in Dataframe
  • 17.
    Functional programming case classPerson(name: String, age: Int) val dataframe = sqlContext.read.json("people.json") val ds : Dataset[Person] = dataframe.as[Person] // Compute histogram of age by name val hist = ds.groupBy(_.name).mapGroups({ case (name, people) => { val buckets = new Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) } })
  • 18.
    Dataset hierarchy SQL Dataframe (&Dataset) Python R Scala/Java Tungsten execution
  • 19.
    Hands on ● Creatingdataset ○ From Collections ○ From File ● Comparison with RDD ○ Operations ○ Distributed Wordcount ● Semistructured data ○ Downcast ○ Upcast
  • 20.
  • 21.
  • 22.
  • 23.
    Roadmap ● Dataset, thename itself may change ● Performance optimizations ● Unification of DataFrames with Dataset ● Public API for Encoders ● Support for most of the RDD operators on Dataset
  • 24.
    Unification of DataFrameswith Dataset class Dataset[T]( val sqlContext: SQLContext, val queryExecution: QueryExecution)( implicit val encoder: Encoder[T]) class DataFrame( sqlContext: SQLContext, queryExecution: QueryExecution) extends Dataset[Row](sqlContext, queryExecution)(new RowEncoder)
  • 25.