4. Spark Features
• Spark is a general computa2on engine that uses distributed
memory to perform fault-tolerant computa2ons with a cluster
• Speed
• Ease of use
• Analy2c
• Environments that require
• Large datasets
• Low latency processing
• Spark can perform itera2ve computa2ons at scale (in memory)
which opens up the possibility of execu2ng machine learning
algorithms much faster than with Hadoop MR (disk-based)[2] [4].
ITV-DS, Applied Compu2ng Group.
Sergio Viademonte, PhD.
5. Spark Features
• Computa2onal engine:
• Scheduling
• Distribu2ng
• Monitoring
applica2ons consis2ng of many computa2onal tasks across a
computa2onal cluster.
• From an engineering perspec2ve Spark hides the complexity of:
• distributed systems programming
• network communica2on
• and fault tolerance.
ITV-DS, Applied Compu2ng Group.
Sergio Viademonte, PhD.
9. Spark Ecosystem
§ Resilient Distributed Datasets (RDDs)
• Spark main programming abstrac2on for working with data.
• RDDs represent a fault-tolerant collec2on of elements
distributed across many compute nodes that can be
manipulated in parallel.
• Spark Core provides many APIs for building and manipula2ng
these collec2ons.
• All work is expressed as
• crea2ng new RDDs
• transforming exis2ng RDDs – return pointers to RDDs
• ac2ons, calling opera2ons on RDDs - return values
Ex: val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
ITV-DS, Applied Compu2ng Group.
Sergio Viademonte, PhD.