SlideShare a Scribd company logo
Marko Velic PhD
Data Science Department
Styria Medijski Servisi d.o.o.
marko.velic@styria.hr
UNSUPERVISED LEARNING
(WITH SPARK)
CONTENTS
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
• Example (Spark)
 Examples from Styria practice (not Spark – for now)
10.03.2016 2
MACHINE LEARNING
10.03.2016 3
UNSUPERVISED LEARNING
 Opservations are not assigned to classes
 Computer program is not ‘supervised’
throughout the learning process
 Usually the task is to find ‘meaningful’
groups within data
 Decision is made based on distances i.e.
similarities among data points
10.03.2016 4
DISTANCES
10.03.2016 5
• To decide upon the groups we have to introduce
similarity measure or contrary – a distance measure
• Pitagora’s theorem – Euclidean distance
• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -
2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
DISTANCES & APPROACHES
10.03.2016 6
Source:
http://en.wikipedia.org/wiki/Man
hattan_distance
 Manhattan/Cityblock/Taxicab
• dist((x, y), (a, b)) = |x - a| + |y - b|
 Normalization!
 Mahalanobis – considers variance
• “multidimensional z-score”
 Cosine similarity
 Autoencoders – ‘unsupervised’ neural nets
 Non-unsupervised but based on distances
• ReliefF measure, KNN classifier ... etc...
K-MEANS
7
Simplified:
1. Randomly place
centroids
2. Find the closest
3. Put centroid in the
middle
4. GOTO 2
Image source:
http://www.javabeat.net/2011/05/k-means-
clustering-algorithms-in-mahout/
DEMO (SPARK!)
 K-means clustering of photos (ie.
their vector representations)
 Convolutional neural network as
a supervised model and its
outputs as features for
unsupervised models
 Vector representations after the
pooling layers, after every
convolutional layer (Caffe)
 Clustering in Spark
8
T-SNE CLUSTER VISUALIZATION
9
SEMI-MANUAL CLUSTERING OF PHOTOS
10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
SEMI-MANUAL CLUSTERING OF PHOTOS
11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
NATURAL LANGUAGE PROCESSING
10.03.2016 12
T-sne concept visualization; vecernji.hr, Styria Data Science Team
AUTOMATIC (LEARNED) HIERARCHIES
13
Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
VISUAL SEARCH EXAMPLE
14
CONCLUSION
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
 We can nicely combine supervised and unsupervised
features
 SparkNet: Training Deep Networks in Spark
http://arxiv.org/pdf/1511.06051v4.pdf
 https://news.developer.nvidia.com/caffe-on-spark-for-
deep-learning-from-yahoo/
10.03.2016 15
THANK YOU!
CONCLUSION
10.03.2016 17

More Related Content

Similar to Unsupervised learning with Spark

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Sergey Karayev
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
Lalal
LalalLalal
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
Databricks
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
prjpublications
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
Giorgio Carbone
 
image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptx
fgdg12
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
Pier Luca Lanzi
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
Gaston Liberman
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
Dong Heon Cho
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
Md Abul Hayat
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22oct
Aleemuddin Abbasi
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
Mark Moriarty
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
REHMAT ULLAH
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
YaswanthHariKumarVud
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
Chen Sagiv
 
My MS defense
My MS defenseMy MS defense
My MS defense
Alexander Shyrokov
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
Jeff Hammerbacher
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Christophe Debruyne
 

Similar to Unsupervised learning with Spark (20)

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Lalal
LalalLalal
Lalal
 
Astronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache SparkAstronomical Data Processing on the LSST Scale with Apache Spark
Astronomical Data Processing on the LSST Scale with Apache Spark
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptx
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
 
Deep Learning AtoC with Image Perspective
Deep Learning AtoC with Image PerspectiveDeep Learning AtoC with Image Perspective
Deep Learning AtoC with Image Perspective
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22oct
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
Introduction talk to Computer Vision
Introduction talk to Computer Vision Introduction talk to Computer Vision
Introduction talk to Computer Vision
 
My MS defense
My MS defenseMy MS defense
My MS defense
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology DomainFacilitating Data Curation: a Solution Developed in the Toxicology Domain
Facilitating Data Curation: a Solution Developed in the Toxicology Domain
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 

Unsupervised learning with Spark

  • 1. Marko Velic PhD Data Science Department Styria Medijski Servisi d.o.o. marko.velic@styria.hr UNSUPERVISED LEARNING (WITH SPARK)
  • 2. CONTENTS  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means • Example (Spark)  Examples from Styria practice (not Spark – for now) 10.03.2016 2
  • 4. UNSUPERVISED LEARNING  Opservations are not assigned to classes  Computer program is not ‘supervised’ throughout the learning process  Usually the task is to find ‘meaningful’ groups within data  Decision is made based on distances i.e. similarities among data points 10.03.2016 4
  • 5. DISTANCES 10.03.2016 5 • To decide upon the groups we have to introduce similarity measure or contrary – a distance measure • Pitagora’s theorem – Euclidean distance • dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 - 2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
  • 6. DISTANCES & APPROACHES 10.03.2016 6 Source: http://en.wikipedia.org/wiki/Man hattan_distance  Manhattan/Cityblock/Taxicab • dist((x, y), (a, b)) = |x - a| + |y - b|  Normalization!  Mahalanobis – considers variance • “multidimensional z-score”  Cosine similarity  Autoencoders – ‘unsupervised’ neural nets  Non-unsupervised but based on distances • ReliefF measure, KNN classifier ... etc...
  • 7. K-MEANS 7 Simplified: 1. Randomly place centroids 2. Find the closest 3. Put centroid in the middle 4. GOTO 2 Image source: http://www.javabeat.net/2011/05/k-means- clustering-algorithms-in-mahout/
  • 8. DEMO (SPARK!)  K-means clustering of photos (ie. their vector representations)  Convolutional neural network as a supervised model and its outputs as features for unsupervised models  Vector representations after the pooling layers, after every convolutional layer (Caffe)  Clustering in Spark 8
  • 10. SEMI-MANUAL CLUSTERING OF PHOTOS 10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 11. SEMI-MANUAL CLUSTERING OF PHOTOS 11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 12. NATURAL LANGUAGE PROCESSING 10.03.2016 12 T-sne concept visualization; vecernji.hr, Styria Data Science Team
  • 13. AUTOMATIC (LEARNED) HIERARCHIES 13 Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
  • 15. CONCLUSION  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means  We can nicely combine supervised and unsupervised features  SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf  https://news.developer.nvidia.com/caffe-on-spark-for- deep-learning-from-yahoo/ 10.03.2016 15