Embed presentation
Download to read offline



![Data collections in Spark
Collections: immutable, distributed, partitioned across nodes, operated in parallel
● Resilient Distributed Dataset (RDD)
○ Basic abstraction
○ Low-level API
○ Suitable for unstructured data (media, streams of text)
● Dataset/DataFrame
○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API
○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access
○ optimizations](https://image.slidesharecdn.com/javanturav4-gettingstartedwithapachesparkdinkosrko-170217082127/75/Javantura-v4-Getting-started-with-Apache-Spark-Dinko-Srkoc-4-2048.jpg)


This document provides an introduction to Apache Spark and Scala. It discusses that Apache Spark is a general purpose cluster computing system that is faster than Hadoop MapReduce, runs locally and in the cloud. It has high-level APIs for Scala, Python, Java and R. The document outlines Spark's core components including SQL, MLlib, GraphX and streaming. It describes Spark's main data collections of RDDs for unstructured data and DataFrames/Datasets for structured data. Finally, it provides an overview of demonstrations that will be covered including the Spark shell, notebook, streaming and deploying a mini project to Google Cloud Dataproc.



![Data collections in Spark
Collections: immutable, distributed, partitioned across nodes, operated in parallel
● Resilient Distributed Dataset (RDD)
○ Basic abstraction
○ Low-level API
○ Suitable for unstructured data (media, streams of text)
● Dataset/DataFrame
○ Dataset[T] - typed API, DataFrame (a.k.a. DataSet[Row]) - untyped API
○ High-level expressions: filters/maps, aggregations, averages, SQL queries, columnar access
○ optimizations](https://image.slidesharecdn.com/javanturav4-gettingstartedwithapachesparkdinkosrko-170217082127/75/Javantura-v4-Getting-started-with-Apache-Spark-Dinko-Srkoc-4-2048.jpg)

