Spark SQL with Scala Code Examples


A concentrated look at Apache Spark's library Spark SQL including background information and numerous Scala code examples of using Spark SQL with CSV, JSON and databases such as mySQL.

  1. 1. Spark SQL Code Examples
  2. 2. Background • Spark SQL is Spark's module for working with structured data. • Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R. • Born out of Shark project at Berkeley
  3. 3. Assumptions These slides and examples assume you already have at least a basic understanding of Spark constructs such as RDDs, Actions, Transformers.
  5. 5. Introduction • DataFrames are a kind of Resilient Distributed Data Set • DataFrames are composed of Row objects accompanied with schema which describes the data types of each column. • A DataFrame may be considered similar to a table in a traditional relational database
  6. 6. 1. $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0 2. scala>val baby_names ="com.databricks.spark.csv").option("he ader", "true").option("inferSchema", “true").load("baby_names.csv") 3. scala> baby_names.registerTempTable(“names") 4. scala> val distinctYears = sqlContext.sql("select distinct Year from names”) 5. scala> distinctYears.collect.foreach(println) Spark SQL with CSV
  7. 7. JSON in following examples: {"first_name":"James", "last_name":"Butterburg", "address": {"street": "6649 N Blue Gum St", "city": "New Orleans","state": "LA", "zip": "70116" }} {"first_name":"Josephine", "last_name":"Darakjy", "address": {"street": "4 B Blue Ridge Blvd", "city": "Brighton","state": "MI", "zip": "48116" }} {"first_name":"Art", "last_name":"Chemel", "address": {"street": "8 W Cerritos Ave #54", "city": "Bridgeport","state": "NJ", "zip": "08014" }} Spark SQL with JSON (slide 1 of 2)
  8. 8. 1. $SPARK_HOME/bin/spark-shell 2. scala> val customers = sqlContext.jsonFile(“customers.json") 3. scala> customers.registerTempTable(“customers") 4. scala> val firstCityState = sqlContext.sql("SELECT first_name,, address.state FROM customers") Spark SQL with JSON (slide 2 of 2)
  9. 9. Requirements 1. MySQL instance 2. MySQL JDBC driver Spark SQL with JDBC mySQL (slide 1 of 2)
  10. 10. 1. $SPARK_HOME/bin/spark-shell –jars mysql-connector- java-5.1.26.jar 2. val dataframe_mysql ="jdbc").option("url", "jdbc:mysql://localhost/sparksql").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "baby_names").option("user", "root").option("password", “root").load() 3. scala> dataframe_mysql.registerTempTable(“names") 4. scala> dataframe_mysql.sqlContext.sql("select * from names”).collect.foreach(println) Spark SQL with JDBC mySQL (slide 2 of 2)
