This document contains a presentation on Scala and Spark given by Javier Santos in Madrid, May 2015. It introduces key concepts of Spark including Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, and key-value pairs. It provides examples of common RDD operations like map, filter, join, and combineByKey. The presentation also discusses Spark execution model involving drivers, executors, jobs, and stages, and deployment on local or standalone clusters.
Codemotion akka persistence, cqrs%2 fes y otras siglas del montónJavier Santos Paniego
¿Qué me dices?¿Que puedo separar la parte de acciones de las vistas de mi aplicación?¿Que además puedo hacerlo escalable usando Akka-persistence?¿Basado en eventos?
En esta charla, el equipo de Scalera esperamos enseñaros cómo funciona la aproximación CQRS (recientemente implantada en sistemas como Lagom de Lightbend) para crear aplicaciones altamente escalables y resilient, usando Akka persistence como herramienta para poder guardar el estado de tu aplicación de manera distribuida y basada en eventos.
Si no te ha sonado a anuncio de tele-tienda pasivo-agresivo y aún te quedan ganas de descubrir que significan todas estas siglas raras, esta es tu charla.
https://github.com/leobenkel/Zparkio
Slides presented during the ScalaSF meetup on Thursday, March 26, 2020.
https://www.meetup.com/SF-Scala/events/268998404/
ZparkIO was on version 0.7.0 at the time, so things might be out of date.
Codemotion akka persistence, cqrs%2 fes y otras siglas del montónJavier Santos Paniego
¿Qué me dices?¿Que puedo separar la parte de acciones de las vistas de mi aplicación?¿Que además puedo hacerlo escalable usando Akka-persistence?¿Basado en eventos?
En esta charla, el equipo de Scalera esperamos enseñaros cómo funciona la aproximación CQRS (recientemente implantada en sistemas como Lagom de Lightbend) para crear aplicaciones altamente escalables y resilient, usando Akka persistence como herramienta para poder guardar el estado de tu aplicación de manera distribuida y basada en eventos.
Si no te ha sonado a anuncio de tele-tienda pasivo-agresivo y aún te quedan ganas de descubrir que significan todas estas siglas raras, esta es tu charla.
https://github.com/leobenkel/Zparkio
Slides presented during the ScalaSF meetup on Thursday, March 26, 2020.
https://www.meetup.com/SF-Scala/events/268998404/
ZparkIO was on version 0.7.0 at the time, so things might be out of date.
March 10, 2016 - Hadoop Meeting. Rajiv presentation on starting Scala @airisdata. An introduction to programming in SCALA from the REPL with live examples.
The slides for the short seminar about the Scala programming language. The seminar takes place in HIT.
More information about the Scala course I deliver can be found at scala.course.lifemichael.com
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
Scala can be used for developing both frontend (Scala.js) and backend (Scala JVM) applications. A missing piece has been bridging these two worlds using Scala. We built Airframe RPC, a framework that uses Scala traits as a unified RPC interface between servers and clients. With Airframe RPC, you can build HTTP/1 (Finagle) and HTTP/2 (gRPC) services just by defining Scala traits and case classes. It simplifies web application design as you only need to care about Scala interfaces without using existing web standards like REST, ProtocolBuffers, OpenAPI, etc. Scala.js support of Airframe also enables building interactive Web applications that can dynamically render DOM elements while talking with Scala-based RPC servers. With Airframe RPC, the value of Scala developers will be much higher both for frontend and backend areas.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
March 10, 2016 - Hadoop Meeting. Rajiv presentation on starting Scala @airisdata. An introduction to programming in SCALA from the REPL with live examples.
The slides for the short seminar about the Scala programming language. The seminar takes place in HIT.
More information about the Scala course I deliver can be found at scala.course.lifemichael.com
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
Scala can be used for developing both frontend (Scala.js) and backend (Scala JVM) applications. A missing piece has been bridging these two worlds using Scala. We built Airframe RPC, a framework that uses Scala traits as a unified RPC interface between servers and clients. With Airframe RPC, you can build HTTP/1 (Finagle) and HTTP/2 (gRPC) services just by defining Scala traits and case classes. It simplifies web application design as you only need to care about Scala interfaces without using existing web standards like REST, ProtocolBuffers, OpenAPI, etc. Scala.js support of Airframe also enables building interactive Web applications that can dynamically render DOM elements while talking with Scala-based RPC servers. With Airframe RPC, the value of Scala developers will be much higher both for frontend and backend areas.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Did you miss Scala Days 2015 in San Francisco? Have no fear! BoldRadius was there and we've compiled the best of the best! Here are the highlights of a great conference.
1. MADRID · MAY · 2015
Scala Programming @ Madrid
Introduction to
Javier Santos
May,11 - 2015
2. MADRID · MAY · 2015
Scala Programming @ Madrid
About me
Javier Santos @jpaniego
«Hay dos formas de programar
sin errores; solo la tercera
funciona»
Alan J Perlis
3. MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer I
● I’m no Spark evangelizer
4. MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer II
● I’m no expert
5. MADRID · MAY · 2015
Scala Programming @ Madrid
Disclaimer III
● I’m not paying beers
6. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to 5
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 15
7. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to Int.MaxValue
val action: Seq[Int] => Int =
(seq: Seq[Int]) => seq.sum
val result: Int = 1073741824
8. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
Computing problems:
● Time: Waiting for action results on tons of records
● Space: How do we allocate TB of data?
9. MADRID · MAY · 2015
Scala Programming @ Madrid
downloadmoreram.com
11. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val mySeq: Seq[Int] = 1 to 99
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
12. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val action: Seq[Int] => Int = (seq: Seq[Int]) => seq.sum
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
action action
action
13. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
val result: Int = 4950
val mySeq_1 =
1 to 33
val mySeq_3 =
67 to 99
val mySeq_2 =
34 to 66
561
1650
2739
14. MADRID · MAY · 2015
Scala Programming @ Madrid
What’s Spark?
● General engine for large-scale data processing.
● “In-memory” data allocation.
● Target: data and computing parallelism.
● Scala, Python and Java APIs.
● First developed by AMPLab in UC Berkeley
● Support by Databricks
● Latest stable version 1.3.1
15. MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
16. MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
17. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Linear calculus method used for supporting parallel-
computing and big data sets distribution among computing
groups
● 80s. Google. PageRank. Big arrays computation.
● Based on two main functional programming methods: Map
and reduce.
● Hadoop
○ One of the first open source implementations
○ Most of contributions made by Yahoo
18. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Phases
Map Combiner Partitioner Reduce
19. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Map)
“123;SamsungGalaxyIII;14:05;300;34612345678”
“124;LGNexus4;16:05;121;+34613455678”
“126;LGNexus4;12:05;23;+3463624678”
“131;NokiaLumia;14:05;300;+34613246778”
“125;MotorolaG2;14:05;300;+34612345678”
“127;NokiaLumia;14:05;300;+34612345678”
“130;NokiaLumia;14:05;300;+34612345678”
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
Partitions
20. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Combiner - local aggregator)
(SamsungGalaxyIII, 1)
(LGNexus4, 1)
(LGNexus4, 1)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 1)
(NokiaLumia, 1)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 2)
21. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Partitioner)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia,
1)
(MotorolaG2, 1)
(NokiaLumia, 2)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1)
22. MADRID · MAY · 2015
Scala Programming @ Madrid
MapReduce model
● Example (Reduce)
(SamsungGalaxyIII, 1)
(LGNexus4, 2)
(NokiaLumia, 3)
(MotorolaG2, 1) (NokiaLumia, 3)
ReduceFunction
f(t1:(String,Int),t2:(String,Int))
=
if (t1._2 > t2._2) t1 else t2
23. MADRID · MAY · 2015
Scala Programming @ Madrid
So...what about Spark?
24. MADRID · MAY · 2015
Scala Programming @ Madrid
Deployment terms
Driver Master
Worker
Worker
Worker
Spark
Cluster
25. MADRID · MAY · 2015
Scala Programming @ Madrid
Deployment types
● Local
○ master=”local[N]” (N=Amount of cores)
○ Spark master is launched in the same process.
○ Workers are launched in the same process.
● Standalone
○ master=”spark://master-url”
○ Spark master is launched in a cluster machine.
○ Submit JAR to worker nodes.
○ Resource managers: YARN, Mesos
26. MADRID · MAY · 2015
Scala Programming @ Madrid
Execution terms
Driver
Executor
Executor
Executor
Executor
Executor
Spark
Context
27. MADRID · MAY · 2015
Scala Programming @ Madrid
Execution terms : SparkContext & Executor
● SparkContext
○ Spark cluster connection
○ Necessary for SQLContext, StreamingContext, …
○ Isolated: Two Spark Contexts cannot exchange data (without external help).
● Executor
○ Individual execution unit
○ 1 core ~ 2 executor
○ Each application has its owns.
28. MADRID · MAY · 2015
Scala Programming @ Madrid
Job terms
Job
Stage Stage Stage
Task Task Task Task Task
29. MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
30. MADRID · MAY · 2015
Scala Programming @ Madrid
● Resilient Distributed Dataset. Basic abstraction in Spark
● Think of it like a huge collection partitioned and distributed in
different machines.
● Lazy evaluated
● Basically composed of
○ A list of partitions
○ A function for computing each split
○ A list of dependencies on other RDDs
RDD
31. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD
● Basic example:
val myRdd: RDD[String] = sc.parallelize(
“this is a sample string”.split(“ “).toList)
val anotherRdd: RDD[(Int,String)] = sc.parallelize(
List(
0 -> “A value for key ‘0’ ”,
1 -> “Another value”))
32. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD
● An RDD may be created from
○ A file / set of files:
sc.textFile(“myFile”)
sc.textFile(“file1,file2”)
○ A bunch of memory-storaged data:
sc.parallelize(List(1,2,3))
Another RDD:
myRdd.map(_.toString)
○ Another RDD
33. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Partitions
● Partition : RDD chunk
● A worker node may contain 1 or more partitions of some RDD.
● It’s important to choose a best-performance partitioning.
● Repartitioning
○ It implies shuffling all RDD data among worker nodes
○ There will surely be tons of network traffic among all worker nodes!
34. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Lineage
● An RDD may be built from another one.
● Base RDD : Has no parent
● Ex:
val base: RDD[Int] = sc.parallelize(List(1,2,3,4))
val even: RDD[Int] = base.filter(_%2==0)
● DAG (Directed Acyclic Graph) : Generalization of MapReduce
model.
Manages RDD lineages (source, mutations, …).
35. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Lineage
36. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● It applies a change to some RDD, returning a new one.
● It’s not immediately evaluated.
● Think of it like a “Call-by-name” function. λ
● Adds a new node in Lineage DAG.
37. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● Most frequently used:
○ map
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[String] = myRDD.map(_.toString)
○ flatMap
val myRDD: RDD[Int] =
sc.parallelize(List(1,null,3,null))
val myNotNullInts: RDD[Int] =
myRdd.flatMap(n => Option(n))
38. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Transformations
● Most frequently used:
○ filter
val myRDD: RDD[Int] = sc.parallelize(List(1,2,3,4))
val myStrinRDD: RDD[Int] = myRDD.filter(_%2==0)
○ union
val odd: RDD[Int] = sc.parallelize(List(1,3,5))
val even: RDD[Int] = sc.parallelize(List(2,4,6))
val all: RDD[Int] = odd.union(even)
39. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Actions
● It launches the RDD evaluation,
○ returning some data to the driver
○ or persisting to some external storage system
● It’s evaluated partially on each worker node, and results are
merged in the Driver
● There are mechanisms to make cheaper computing several times
the same RDD (like persist() or cache() )
40. MADRID · MAY · 2015
Scala Programming @ Madrid
RDD - Actions
● Examples:
○ count
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val size: Int = myRdd.count
Counting implies processing whole RDD
○ take
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val List(first,second) = myRdd.take(2).toList
○ collect
val myRdd: Rdd[Int] = sc.parallelize(List(1,2,3))
val data: Array[Int] = myRdd.collect
Beware! Executing ‘collect’ on big collections might end into a memory leak
41. MADRID · MAY · 2015
Scala Programming @ Madrid
Demo1 : Most retweeted
● Most Retweeted example
● Bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet.
42. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● Particular case of RDD[T] where T = (U,V)
● It allows grouping,combining,aggregating values by some key.
● In Scala it’s only needed to
import org.apache.spark.SparkContext._
● In Java, it’s mandatory to use PairRDD class.
43. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● keyBy: Generates a PairRDD
val myRDD: RDD[Int] =
sc.parallelize(List(1,2,3,4,5))
val kvRDD: RDD[(String,Int)] =
myRDD.keyBy(
n => if (n%2==0) “even” else “odd”)
“odd” -> 1, “even” -> 2, “odd” -> 3, “even” -> 4, “odd” -> 5
44. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● keys: Gets keys from PairRDD
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val keysRDD: RDD[String] = myRDD.keys
“odd”,”even”,”odd”
● values: Gets values from PairRDD
45. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● mapValues: map PairRDD values, omitting keys.
val myRDD: RDD[(String,Int)] =
sc.parallelize(“odd” -> 1,”even” -> 2,”odd” -> 3)
val mapRDD: RDD[(String,String)] =
myRDD.mapValues(_.toString)
“odd” -> “1”, “even” -> “2”, “odd” -> “3”
● flatMapValues: flatMap PairRDD values
46. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs
● join: Return a new RDD with both RDD joined by key.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
val c =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","be
e"), 3)
val d = c.keyBy(_.length)
b.join(d).collect
res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)),
(6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)),
(3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
47. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● combineByKey:
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C,V) => C,
mergeCombiners: (C,C) => C): RDD[(K,C)]
● Think of it as something-like-but-not a foldLeft over each
partition λ
48. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● It’s composed by:
○ createCombiner( V => C): Sets the way to mutate initial RDD[V] data into new data
type used for aggregating values (C). This will be called Combinator.
○ mergeValue((C,V) => C): Defines how to aggregate initial V values to our Combiner
type C, returning a new combiner type C.
○ mergeCombiners((C,C) => C): Defines how to merge two combiners into a new one.
49. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - combineByKey
● Example:
val a =
sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf"
,"bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x,
(x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee,
bear, wolf)))
50. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - aggregateByKey
● aggregateByKey : Aggregate the values of each key, using given
combine functions and a neutral “zero value”.
● Beware! Zero value is evaluated in each partition.
def aggregateByKey[U:ClassTag](zeroValue: U)(
seqOp: (U,V) => U, combOp: (U,U) => U): RDD[(K,U)] =
combineByKey(
(v: V) => seqOp(zeroValue,v),
seqOp,
comboOp)
51. MADRID · MAY · 2015
Scala Programming @ Madrid
Key-Value RDDs - groupByKey
● groupByKey: Group all values with same key into a Iterable of
values
def groupByKey(): RDD[(Key,Iterable[Value])] =
combineByKey[List[Value]](
(v: V) => List(v),
(list: List[V],v: V) => list+= v,
(l1: List[V],l2: List[V]) => l1 ++ l2)
Note: GroupByKey actually uses CompactBuffer instead of list.
52. MADRID · MAY · 2015
Scala Programming @ Madrid
Modules
54. MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL
● Shark successor (Berkeley)
○ Alternative to Hive + Hadoop
● Query language
○ HiveQL
○ Future: Support full SQL92
● SQLContext
val sqlContext = new SQLContext(sparkContext)
val hiveContext = new HiveContext(sparkContext)
55. MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL
● Different datasources
○ JSON
○ Parquet
○ CSV
● New implementations
○ Cassandra
○ ElasticSearch
○ MongoDB
● Unified API in Spark 1.3.0
val students: DataFrame = sqlContext.load(
“students”,“org.apache.sql.parquet”,Map(...))
56. MADRID · MAY · 2015
Scala Programming @ Madrid
SparkSQL - DataFrame
● DataFrame = RDD[org.apache.spark.sql.Row] + Schema
● A Row holds both column values and their types.
● This allows unifying multiple datasources with the same API.
i.e, we could join two tables, one declared on Mongo and another
on ElasticSearch.
57. MADRID · MAY · 2015
Scala Programming @ Madrid
Demo2 : Most retweeted(SparkSQL)
● Most Retweeted Example (SparkSQL)
● Same bunch of tweets (1000 json records)
● Find out which is the most retweeted tweet using SparkSQL
58. MADRID · MAY · 2015
Scala Programming @ Madrid
Who uses it?
59. MADRID · MAY · 2015
Scala Programming @ Madrid
Let’s talk about...
● Spark streaming
● Machine learning (MLlib)
● GraphX vs Neo4j
● Spark vs Hadoop
● ?