Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

927 views

Published on

Talk done at The Amsterdam Applied Machine Learning meetup group.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
927
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group

  1. 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @fzk frisovanvollenhoven@godatadriven.com Apache Spark Friso van Vollenhoven for applied machine learning
  2. 2. This talk is about tools.
  3. 3. Resilient Distributed Dataset •Immutable set of records (e.g. tuples) •Distributed across a cluster of workers •Stored in RAM or on disk (partially) •Built through transformations •Automatically rebuilt on failure •Possibly replicated
  4. 4. Operations •Operate on RDD’s •Create a new RDD •Or materialise RDD and return data •Transformations: map, filter, groupBy, etc. •Actions: count, collect, reduce, save, etc.
  5. 5. The good parts •Language bindings for Java, Scala and Python •Works interactively from a shell: •Scala + IPython (notebook) •Plays nice with Hadoop •Deploy on top of YARN cluster manager •Read data from HDFS •Hadoop-like fault tolerance
  6. 6. The better part? https://github.com/Bridgewater/scala-notebook
  7. 7. https://github.com/Sotera/spark-distributed-louvain-modularity
  8. 8. GoDataDriven We’re hiring / Questions? / Thank you! @fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven

×