Spark SQL is a module for structured data processing on Spark. It integrates relational processing with Spark's functional programming API and allows SQL queries to be executed over data sources via the Spark execution engine. Spark SQL includes components like a SQL parser, a Catalyst optimizer, and Spark execution engines for queries. It supports HiveQL queries, SQL queries, and APIs in Scala, Java, and Python.
13. Row API
13
trait Row extends Seq[Any] with Serializable
{
def apply(i: Int): Any
def isNullAt(i: Int): Boolean
def getInt(i: Int): Int
def getLong(i: Int): Long
def getDouble(i: Int): Double
def getFloat(i: Int): Float
def getBoolean(i: Int): Boolean
def getShort(i: Int): Short
def getByte(i: Int): Byte
def getString(i: Int): String
def getAs[T](int: Int): T
}
Row class is the key data structure widely used
internal / external Spark SQL.
“def getAs[T]” is used for non-primitive data types
Field value represented as native language data
type.
Field type represented as DataType described in last
slice.
17. Spark Plan (Physical Plan)
Root class of Spark Plan Operator (Physical Plan Operator for Spark)
Spark Plan Operators
Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.)
Aggregate: Aggregate
BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.)
Shuffle: AddExchange, Exchange
Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.)
..
Spark Strategy (SparkPlanner)
Map the Optimized Logical Plan to Spark Plan
17
abstract class SparkPlan {
def children: Seq[SparkPlan]
/** Specifies how data is partitioned across different nodes in the cluster. */
def outputPartitioning: Partitioning = UnknownPartitioning(0)
/** Specifies any partition requirements on the input data for this operator. */
def requiredChildDistribution: Seq[Distribution] =
Seq.fill(children.size)(UnspecifiedDistribution)
def execute(): RDD[Row]
}
Optimized
Logical Plan
Spark Plan
RDD
Spark Execution