High level API
DataFrame abstraction
Thought to be used by Data Scientist without Spark knowledge
But Spark expertize is needed to have good performance
Great performance
Parallel and scalable
In memory caching and computing
With the previous equation, each point Silhouette coefficient can be computed without computing the distance to all the other points
We precompute the needed values for the clusters (ie. We precompute our state)
For each cluster we need to compute 2 constant and one vector
We can assume the number of cluster is rather small
Then, our shared state is small
We compute the above formula for all the clusters for each point, and - with these computed average distances - we compute the Silhouette coefficient for each point
The average of all the Silhouette coefficients is computed
Thus, the computational complexity of the needed steps is:
O(N D / W), it requires a one-pass aggregation over the entire dataset
O(N C D / W), for each point we compute the average distance to all the clusters
O(N / W), it requires a one-pass aggregation over the entire dataset
We need 2 passes over the dataset:
one to precompute the state
one to compute the coefficients and their average
The comparison is fair: no parallelism is exploited.
Only thanks to the computational complexity.
This is for small dataset, our implementation enables also to compute it over larger ones.