Spark

Features of Apache Spark
Speed
Supports multiple languages
Advanced Analytics

Iterative Operations on Spark RDD

Interactive Operations on Spark
RDD

Initializing a SparkContext in
Scala
 Import org.apache.spark.SparkConf
 Import org.apache.spark.SparkContext
 Import org.apache.spark.SparkContext._
 Val conf = new
SparkConf().setMaster("local").setAppN
ame("My App") valsc = new
SparkContext(conf)

Limitation of Hive
 HIVE uses Map-Reduce which lags in
performance with medium and small
sized data-sets(<200GB).
 No resume capability.
 Hive content drop encrypted Databases.
(SPARK SQL WAS BUILT TO
OVERCOME THE LIMITATIONS OF
APACHE HIVE RUNNING ON TOP OF
SPARK.)

Features of Spark SQL
Integrated
Unified Data Access
Hive Compatibility
Standard Connectivity
Scalability

Spark SQL – Data Sources
A DataFrame interface allows different Data Sources to
work on Spark SQL. It is a temporary table and can be
operated as a normal RDD. Registering a DataFrame as a
table allows you to run SQL queries over its data.
There are different types of data sources available in
Spark SQL, some of which are listed below –
 JSON Datasets
 Hive Tables
 Parquet Files

Data Analysis Flow Diagram
Fig: Spark SQL Flow Chart

 SWITCHING FROM SPARK-CONTEXT TO
SQL

IMPORTING ALL PACKAGES(TO
CONVERT RDD TO DATA-SET)

READING THE FILE AND SCHEMA
CHECKING DONE

Scala>df.registerTemoTable(“Terror”)
We have to register this table class as a temp table

Classify attacks on the basis of gang name.

Classify the motive of attack on India.

References
 Useful Links on Spark SQL
 Spark SQL Wiki - Wikipedia Reference for Spark
SQL.
 https://data-flair.training/blogs/spark-sql-tutorial/
 Useful Books on Spark SQL
 O’Reilly- Learning Spark by Holden Karau, Andy
Konwinski, Patric Wendell &MateiZaharia
 O’Reilly- Advanced Analytics with Spark by Sandy
Ryza, Uri Laserson, Sean Owen & Josh Wills

Spark

More Related Content

What's hot

Similar to Spark

Recently uploaded

Spark