A Step to programming withA Step to programming with
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
Rahul Kumar
Trainee - Software Consultant
Knoldus Software LLP
Building Spark :
1. Pre Build Spark
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz
2. Source Code
http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz
Goto the SPARK_HOME directory.
Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package
To start spark
goto the SPARK_HOME/bin
Execute ./spark-shell
● The main feature of Spark is its in-memory cluster
computing that increases the processing speed of an
application.
● Spark is not a modified version of Hadoop because
it has its own cluster management.
● Spark uses Hadoop in two ways – one is storage
and second is processing. Since Spark has its own
cluster management computation, it uses Hadoop
for storage purpose only.
Spark Features :
Spark applications run as independent
sets of processes on a cluster,coordinated
by the SparkContext object in your main
program (called the driver program).
● Resilient Distributed Datasets (RDD) is a fundamental
data structure of Spark.
● It is an immutable distributed collection of objects.
● RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.
● There are two ways to create RDDs: parallelizing an
existing collection in your driver program
● e.g. val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
● val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
RDD :
● RDD(SPARK)
● HDFS(HADOOP)
● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
● RDDs support two types of operations:
✔ Transformations, which create a new dataset from an existing one, and
✔Actions, which return a value to the driver program after running a
computation on the dataset.
●For example,
✔ map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
✔ reduce is an action that aggregates all the elements of the RDD using
some function and returns the final result to the driver program
●All transformations in Spark are lazy, in that they do not
compute their results right away
RDD :
● A DataFrame is equivalent to a relational table
in Spark SQL.
DataFrame :
● Steps to create DataFrame :
 Create SparkContext object :
– val conf = new
SparkConf().setAppName("Demo").setMaster("local[2]")
– val sc = new SparkContext(conf)
 Create SqlContext object :
– val sqlContext = new SQLContext(sc)
 Read Data From Files :
– val df = sqlContext.read.json("src/main/scala/emp.json")
● A data frame is a table, or two-dimensional array-like structure, in which each column
contains measurements on one variable, and each row contains one case.
● DataFrame has additional metadata due to its tabular format, which allows Spark to
run certain optimizations on the finalized query.
● An RDD, on the other hand, is merely a Resilient Distributed Dataset
that is more of a blackbox of data that cannot be optimized as the
operations that can be performed against it are not as constrained.
● However, you can go from a DataFrame to an RDD via its rdd
method, and you can go from an RDD to a DataFrame (if the RDD is
in a tabular format) via the toDF method
DataFrame and RDD :
DataFrame Transformations :
● Def orderBy(sortExprs: Column*): DataFrame
● Def select(cols: Column*): DataFrame
● Def show(): Unit
● Def filter(conditionExpr: String): DataFrame
● Def groupBy(cols: Column*): GroupedData
●Def collect(): Array[Row]
●Def collectAsList(): List[Row]
●Def count(): Long
●Def head(): Row
●Def head(n: Int): Array[Row]
●Def collect(): Array[Row]
●Def collectAsList(): List[Row]
●Def count(): Long
●Def head(): Row
●Def head(n: Int): Array[Row]
DataFrame Actions :
● Hive is a data warehouse infrastructure tool to process structured data
in Hadoop.
● It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
● It stores schema in a database and processed data into HDFS.
● It provides SQL type language for querying called HiveQL or HQL.
● It is designed for OLAP.
Hive :
● Hive comes bundled with the Spark library as
HiveContext, which inherits from SQLContext.
● Using HiveContext, you can create and find tables in
the HiveMetaStore and write queries on it using
HiveQL.
● Users who do not have an existing Hive deployment
can still create a HiveContext.
● When not configured by the hive-site.xml, the context
automatically creates a metastore called metastore_db
and a folder called warehouse in the current directory.
Spark-Hive :
➢ Spark SQL supports queries written using HiveQL.
➢ Its a SQL-like language that produces queries that are
converted to Spark jobs.
➢ HiveQL is more mature and supports more complex
queries than Spark SQL.
Spark-Hive :(continued)
1) first create a SqlContext instance,
val sqlContext = new SqlContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=sqlContext.sql("select * from employee")
To construct a HiveQL query,
1) first create a new HiveContext instance,
val conf = new SparkConf().setAppName("Demo").setMaster("local[2]")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
2) submit the queries by calling the sql method on the HiveContext instance.
val res=hiveContext.sql("select * from employee")
ReferencesReferences
● http://spark.apache.org/docs/latest/sql-programming-guide.html
● http://www.tutorialspoint.com/spark_sql/spark_introduction.htm
● https://cwiki.apache.org
A Step to programming with Apache Spark

A Step to programming with Apache Spark

  • 1.
    A Step toprogramming withA Step to programming with Rahul Kumar Trainee - Software Consultant Knoldus Software LLP Rahul Kumar Trainee - Software Consultant Knoldus Software LLP
  • 2.
    Building Spark : 1.Pre Build Spark http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz 2. Source Code http://mirror.fibergrid.in/apache/spark/spark-1.6.2/spark-1.6.2.tgz Goto the SPARK_HOME directory. Execute : mvn -Dhadoop.version=1.2.1 -Phadoop-1 -DskipTests clean package To start spark goto the SPARK_HOME/bin Execute ./spark-shell
  • 3.
    ● The mainfeature of Spark is its in-memory cluster computing that increases the processing speed of an application. ● Spark is not a modified version of Hadoop because it has its own cluster management. ● Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only. Spark Features :
  • 4.
    Spark applications runas independent sets of processes on a cluster,coordinated by the SparkContext object in your main program (called the driver program).
  • 7.
    ● Resilient DistributedDatasets (RDD) is a fundamental data structure of Spark. ● It is an immutable distributed collection of objects. ● RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. ● There are two ways to create RDDs: parallelizing an existing collection in your driver program ● e.g. val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) ● val distFile = sc.textFile("data.txt") distFile: RDD[String] = MappedRDD@1d4cee08 RDD :
  • 8.
  • 9.
    ● RDDs supporttwo types of operations: ✔ Transformations, which create a new dataset from an existing one, and ✔Actions, which return a value to the driver program after running a computation on the dataset. ●For example, ✔ map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. ✔ reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program ●All transformations in Spark are lazy, in that they do not compute their results right away ● RDDs support two types of operations: ✔ Transformations, which create a new dataset from an existing one, and ✔Actions, which return a value to the driver program after running a computation on the dataset. ●For example, ✔ map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. ✔ reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program ●All transformations in Spark are lazy, in that they do not compute their results right away RDD :
  • 10.
    ● A DataFrameis equivalent to a relational table in Spark SQL. DataFrame : ● Steps to create DataFrame :  Create SparkContext object : – val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") – val sc = new SparkContext(conf)  Create SqlContext object : – val sqlContext = new SQLContext(sc)  Read Data From Files : – val df = sqlContext.read.json("src/main/scala/emp.json")
  • 11.
    ● A dataframe is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. ● DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. ● An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it are not as constrained. ● However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method DataFrame and RDD :
  • 12.
    DataFrame Transformations : ●Def orderBy(sortExprs: Column*): DataFrame ● Def select(cols: Column*): DataFrame ● Def show(): Unit ● Def filter(conditionExpr: String): DataFrame ● Def groupBy(cols: Column*): GroupedData
  • 13.
    ●Def collect(): Array[Row] ●DefcollectAsList(): List[Row] ●Def count(): Long ●Def head(): Row ●Def head(n: Int): Array[Row] ●Def collect(): Array[Row] ●Def collectAsList(): List[Row] ●Def count(): Long ●Def head(): Row ●Def head(n: Int): Array[Row] DataFrame Actions :
  • 14.
    ● Hive isa data warehouse infrastructure tool to process structured data in Hadoop. ● It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. ● It stores schema in a database and processed data into HDFS. ● It provides SQL type language for querying called HiveQL or HQL. ● It is designed for OLAP. Hive :
  • 15.
    ● Hive comesbundled with the Spark library as HiveContext, which inherits from SQLContext. ● Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. ● Users who do not have an existing Hive deployment can still create a HiveContext. ● When not configured by the hive-site.xml, the context automatically creates a metastore called metastore_db and a folder called warehouse in the current directory. Spark-Hive :
  • 16.
    ➢ Spark SQLsupports queries written using HiveQL. ➢ Its a SQL-like language that produces queries that are converted to Spark jobs. ➢ HiveQL is more mature and supports more complex queries than Spark SQL. Spark-Hive :(continued)
  • 17.
    1) first createa SqlContext instance, val sqlContext = new SqlContext(sc) 2) submit the queries by calling the sql method on the HiveContext instance. val res=sqlContext.sql("select * from employee") To construct a HiveQL query, 1) first create a new HiveContext instance, val conf = new SparkConf().setAppName("Demo").setMaster("local[2]") val sc = new SparkContext(conf) val hiveContext = new HiveContext(sc) 2) submit the queries by calling the sql method on the HiveContext instance. val res=hiveContext.sql("select * from employee")
  • 19.