SlideShare a Scribd company logo
Hussein AL-NATSHEH, CIAPPLE
Jordan Open Source Association
Amman, Aug 22, 2015
Hussein AL-NATSHEH ©2015
 From supervised to unsupervised learning
 Identifying a ML problem?
 Recommended python libraries
 Data preparation
 Case study; Author name disambiguation of CERN digital library (use case)
 References, recommended readings and courses
 A flat partition (set of clusters or segments)
 A hierarchical tree or taxonomy (a set of nested partitions)
 Hard or soft (or fuzzy) memberships to clusters
 K-Means, Expectation Maximization
 Hierarchal Agglomerative Clustering (HAC)
 Density-based spatial clustering of applications with noise (DBSCAN)
 Graph-based clustering (Modularity, Weighted communities, Clique percolation)
 Unsupervised Neural Networks (Adaptive Resonance Theory)
Source: http://scikit-learn.org/stable/modules/clustering.html#clustering
Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
 Graph methods
 Single link
 Complete link
 Average
 Weighted
 Geometric methods
 Ward
 Centroid
 Median
Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
 Some distance functions
 Euclidean distance
 Jaccard Similarity
 Cosine Similarity
 Edit distance
 Distance estimator
 Model an estimator to be used with a predict probability function
 Better for merging feature importances
 Dissimilarity (affinity) matrix
Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
 Single Link method
import scipy.cluster.hierarchy as hac
def __init__(self,method="single",affinity="euclidean“)
self.linkage_ = hac.linkage(X,
method=self.method,
metric=self.affinity)
Image source: Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
scipy.cluster.hierarchy.dendrogram()
Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
 Flat cutting forming partitioned clustering from HAC
 Semi-supervised clustering
 DBSCAN
labels = hac.fcluster(self.linkage_,threshold)
1 and 5 must be in the same cluster  n_clust = 2
Image Source: Matlab
 Classification, recognition and completion
 Decision trees
 Supervised Neural Networks
 Graph models: Hidden Markov Model, Bayesian Nets
 Optimization
 Genetic Algorithms
 Numerical Optimization and regression
 Smart Control
 Fuzzy rules
 Clustering, grouping and segmentation
 Unsupervised Neural Networks
 HAC, K-Means, DBSCAN, Graph-based clustering
 Simulation
 Multi-agent systems
 Relations
 Graph networks
 Collaborative filtering
 Association rules and frequent sets
 Text mining: Topic and concept extraction  Unsupervised: Clustering, Graph and
ontologies. Semi-supervised
 Pattern recognition: Character, speech and image recognition  Supervised: Classification
 Cross selling and personalized recommendation  Collaborative filtering, Association
rules, frequent sets
 Churn management and customer retention  classification and clustering
 Games  search for optimal solution and multi-agent systems
 Financials: risk management in insurance fraud detection  Graph, classification,
clustering
 Security: Intrusion detection, spam filtering  Supervised: Classification
 Social networks: friends suggestion  Graph
 Bioinformatics (DNA)
 High energy physics (Higgs)
 Computer vision, Robotics …
 General statistics
 Correlations
 Plotting
 Understanding the source and the business behind the data
 Well-understanding of the requirements and the goal of the project
 Feature extraction (What is a good feature?)
 Categories vs. continuous features vs. mixed
 Features Correlation
 Dimensionality reduction (the less the better for generalization)
 Missing values
 Normalization
 Outliers
 Feature selection
 Sampling
 Unbalanced dataset
 Scipy
 Numpy (vectorized data operations)
 Matplotlib (plotting)
 Scikit-Learn
 Pandas (data frames)
 NetworkX (Graphs analysis)
 Pylearn2 (deep learning and neural networks)
 BEARD (entity recognition) – to be presented within the use case
 fit
 predict
 fit_transform
 Transformers
 Pipelines
 Features union
 Feature importances
 @property
 Joblib and parallel computing
CERN Digital Library
Source: http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data
Please meetYangYang,YangYang, andYangYang!
Source: Laura Rueda Garcia, CERN
Beard is a Python library of machine learning tools for Bibliographic Entity
Automatic Recognition and Disambiguation.
 Signature: Unique occurrence of author name and publication id
 We have over:
 1.07 million publications
 9.27 millions signatures
 1.2 million claimed signatures (Ground-truth)
 Pair of signature (sig1, sig2, {0,1})
 Need for deviding into blocks
 Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
 BEARD: https://github.com/inveniosoftware/beard
 Scikit-learn: http://scikit-learn.org
 Hussein, AL-NATSHEH. "Bibliographic Entity Automatic Recognition and
Disambiguation.“, CERN, 2015
https://preprints.cern.ch/record/2036112/files/CERN-THESIS-2015-098.pdf
 Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel,Vlad Niculae et al. "API design for machine learning software:
experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).
http://arxiv.org/pdf/1309.0238.pdf
Practice on:
 https://ch.linkedin.com/in/natsheh
 https://twitter.com/hnatsheh
 h.natsheh@ciapple.com Thanks for attending!

More Related Content

What's hot

Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
Editor IJARCET
 
proposal_pura
proposal_puraproposal_pura
proposal_pura
Erick Lin
 
Artificial neural networks and its application
Artificial neural networks and its applicationArtificial neural networks and its application
Artificial neural networks and its application
Hưng Đặng
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLER
ijaia
 

What's hot (20)

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Bhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogueBhadale group of companies ai neural networks and algorithms catalogue
Bhadale group of companies ai neural networks and algorithms catalogue
 
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
 
G44093135
G44093135G44093135
G44093135
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
proposal_pura
proposal_puraproposal_pura
proposal_pura
 
Artificial neural networks and its application
Artificial neural networks and its applicationArtificial neural networks and its application
Artificial neural networks and its application
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
20170412 om patri pres 153pdf
20170412 om patri pres 153pdf20170412 om patri pres 153pdf
20170412 om patri pres 153pdf
 
Proximity aware local-recoding anonymization with map reduce for scalable big...
Proximity aware local-recoding anonymization with map reduce for scalable big...Proximity aware local-recoding anonymization with map reduce for scalable big...
Proximity aware local-recoding anonymization with map reduce for scalable big...
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
 
Multispectral image analysis using random
Multispectral image analysis using randomMultispectral image analysis using random
Multispectral image analysis using random
 
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODELAN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
AN EFFECTIVE SEMANTIC ENCRYPTED RELATIONAL DATA USING K-NN MODEL
 
A0310112
A0310112A0310112
A0310112
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
Handwritten bangla-digit-recognition-using-deep-learning
Handwritten bangla-digit-recognition-using-deep-learningHandwritten bangla-digit-recognition-using-deep-learning
Handwritten bangla-digit-recognition-using-deep-learning
 
Image recognition
Image recognitionImage recognition
Image recognition
 
SCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLERSCALING THE HTM SPATIAL POOLER
SCALING THE HTM SPATIAL POOLER
 

Viewers also liked

Social values of open source - By Issa Mahasneh
Social values of open source - By Issa MahasnehSocial values of open source - By Issa Mahasneh
Social values of open source - By Issa Mahasneh
Jordan Open Source Association
 
"The JavaScript Design Patterns You have to Know" by Rashad Majali
"The JavaScript Design Patterns You have to Know" by Rashad Majali"The JavaScript Design Patterns You have to Know" by Rashad Majali
"The JavaScript Design Patterns You have to Know" by Rashad Majali
Jordan Open Source Association
 

Viewers also liked (17)

A taste of Functional Programming
A taste of Functional ProgrammingA taste of Functional Programming
A taste of Functional Programming
 
JOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to dockerJOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to docker
 
D programming language
D programming languageD programming language
D programming language
 
Drupal - Crash Introduction
Drupal - Crash IntroductionDrupal - Crash Introduction
Drupal - Crash Introduction
 
Social values of open source - By Issa Mahasneh
Social values of open source - By Issa MahasnehSocial values of open source - By Issa Mahasneh
Social values of open source - By Issa Mahasneh
 
"The JavaScript Design Patterns You have to Know" by Rashad Majali
"The JavaScript Design Patterns You have to Know" by Rashad Majali"The JavaScript Design Patterns You have to Know" by Rashad Majali
"The JavaScript Design Patterns You have to Know" by Rashad Majali
 
Material Design by Sultan Shalakhti @ AMMxDROID
Material Design by Sultan Shalakhti @ AMMxDROIDMaterial Design by Sultan Shalakhti @ AMMxDROID
Material Design by Sultan Shalakhti @ AMMxDROID
 
Material Design (The Technical Essentials) by Mohammad Aljobairi @AMMxDROID
Material Design (The Technical Essentials) by Mohammad Aljobairi @AMMxDROIDMaterial Design (The Technical Essentials) by Mohammad Aljobairi @AMMxDROID
Material Design (The Technical Essentials) by Mohammad Aljobairi @AMMxDROID
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
Intro to the Principles of Graphic Design
Intro to the Principles of Graphic DesignIntro to the Principles of Graphic Design
Intro to the Principles of Graphic Design
 
JOSA TechTalks - Compilers, Transpilers, and Why You Should Care
JOSA TechTalks - Compilers, Transpilers, and Why You Should CareJOSA TechTalks - Compilers, Transpilers, and Why You Should Care
JOSA TechTalks - Compilers, Transpilers, and Why You Should Care
 
JOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to ProductionJOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to Production
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Intro to Graphic Design Elements
Intro to Graphic Design ElementsIntro to Graphic Design Elements
Intro to Graphic Design Elements
 
JOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised LearningJOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised Learning
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Web app architecture
Web app architectureWeb app architecture
Web app architecture
 

Similar to JOSA TechTalks - Machine Learning in Practice

Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
butest
 

Similar to JOSA TechTalks - Machine Learning in Practice (20)

Machine learning in computer security
Machine learning in computer securityMachine learning in computer security
Machine learning in computer security
 
Data clustring
Data clustring Data clustring
Data clustring
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...
 
Data Clustering Using Swarm Intelligence Algorithms An Overview
Data Clustering Using  Swarm Intelligence Algorithms  An OverviewData Clustering Using  Swarm Intelligence Algorithms  An Overview
Data Clustering Using Swarm Intelligence Algorithms An Overview
 
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...Introduction to parallel iterative deep learning on hadoop’s next​ generation...
Introduction to parallel iterative deep learning on hadoop’s next​ generation...
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
cluster.pptx
cluster.pptxcluster.pptx
cluster.pptx
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
GTSRB Traffic Sign recognition using machine learning
GTSRB Traffic Sign recognition using machine learningGTSRB Traffic Sign recognition using machine learning
GTSRB Traffic Sign recognition using machine learning
 
Paper id 26201478
Paper id 26201478Paper id 26201478
Paper id 26201478
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Presentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data MiningPresentation on Machine Learning and Data Mining
Presentation on Machine Learning and Data Mining
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 

More from Jordan Open Source Association

More from Jordan Open Source Association (13)

JOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented ArchitectureJOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented Architecture
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
OpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ ScaleOpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ Scale
 
Data-Driven Digital Transformation
Data-Driven Digital TransformationData-Driven Digital Transformation
Data-Driven Digital Transformation
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Processing Arabic Text
Processing Arabic TextProcessing Arabic Text
Processing Arabic Text
 
JOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your CostsJOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your Costs
 
JOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in ProductionJOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in Production
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec Explained
 
JOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and ReduxJOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and Redux
 
JOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best Practices
 
Letter to FCC
Letter to FCCLetter to FCC
Letter to FCC
 
Open Government Data
Open Government DataOpen Government Data
Open Government Data
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

JOSA TechTalks - Machine Learning in Practice

  • 1. Hussein AL-NATSHEH, CIAPPLE Jordan Open Source Association Amman, Aug 22, 2015 Hussein AL-NATSHEH ©2015
  • 2.  From supervised to unsupervised learning  Identifying a ML problem?  Recommended python libraries  Data preparation  Case study; Author name disambiguation of CERN digital library (use case)  References, recommended readings and courses
  • 3.
  • 4.
  • 5.
  • 6.  A flat partition (set of clusters or segments)  A hierarchical tree or taxonomy (a set of nested partitions)  Hard or soft (or fuzzy) memberships to clusters
  • 7.  K-Means, Expectation Maximization  Hierarchal Agglomerative Clustering (HAC)  Density-based spatial clustering of applications with noise (DBSCAN)  Graph-based clustering (Modularity, Weighted communities, Clique percolation)  Unsupervised Neural Networks (Adaptive Resonance Theory)
  • 9. Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 10.  Graph methods  Single link  Complete link  Average  Weighted  Geometric methods  Ward  Centroid  Median Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 11.  Some distance functions  Euclidean distance  Jaccard Similarity  Cosine Similarity  Edit distance  Distance estimator  Model an estimator to be used with a predict probability function  Better for merging feature importances  Dissimilarity (affinity) matrix
  • 12. Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 13.  Single Link method import scipy.cluster.hierarchy as hac def __init__(self,method="single",affinity="euclidean“) self.linkage_ = hac.linkage(X, method=self.method, metric=self.affinity) Image source: Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 14. Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 15. scipy.cluster.hierarchy.dendrogram() Source:Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014
  • 16.  Flat cutting forming partitioned clustering from HAC  Semi-supervised clustering  DBSCAN labels = hac.fcluster(self.linkage_,threshold)
  • 17. 1 and 5 must be in the same cluster  n_clust = 2 Image Source: Matlab
  • 18.
  • 19.  Classification, recognition and completion  Decision trees  Supervised Neural Networks  Graph models: Hidden Markov Model, Bayesian Nets  Optimization  Genetic Algorithms  Numerical Optimization and regression  Smart Control  Fuzzy rules
  • 20.  Clustering, grouping and segmentation  Unsupervised Neural Networks  HAC, K-Means, DBSCAN, Graph-based clustering  Simulation  Multi-agent systems  Relations  Graph networks  Collaborative filtering  Association rules and frequent sets
  • 21.  Text mining: Topic and concept extraction  Unsupervised: Clustering, Graph and ontologies. Semi-supervised  Pattern recognition: Character, speech and image recognition  Supervised: Classification  Cross selling and personalized recommendation  Collaborative filtering, Association rules, frequent sets  Churn management and customer retention  classification and clustering  Games  search for optimal solution and multi-agent systems
  • 22.  Financials: risk management in insurance fraud detection  Graph, classification, clustering  Security: Intrusion detection, spam filtering  Supervised: Classification  Social networks: friends suggestion  Graph  Bioinformatics (DNA)  High energy physics (Higgs)  Computer vision, Robotics …
  • 23.  General statistics  Correlations  Plotting  Understanding the source and the business behind the data  Well-understanding of the requirements and the goal of the project
  • 24.
  • 25.  Feature extraction (What is a good feature?)  Categories vs. continuous features vs. mixed  Features Correlation  Dimensionality reduction (the less the better for generalization)  Missing values  Normalization  Outliers  Feature selection  Sampling  Unbalanced dataset
  • 26.
  • 27.  Scipy  Numpy (vectorized data operations)  Matplotlib (plotting)  Scikit-Learn  Pandas (data frames)  NetworkX (Graphs analysis)  Pylearn2 (deep learning and neural networks)  BEARD (entity recognition) – to be presented within the use case
  • 28.  fit  predict  fit_transform  Transformers  Pipelines  Features union  Feature importances  @property  Joblib and parallel computing
  • 32. Beard is a Python library of machine learning tools for Bibliographic Entity Automatic Recognition and Disambiguation.
  • 33.  Signature: Unique occurrence of author name and publication id  We have over:  1.07 million publications  9.27 millions signatures  1.2 million claimed signatures (Ground-truth)  Pair of signature (sig1, sig2, {0,1})  Need for deviding into blocks
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.  Data Clustering- Part 2, M2 DMKM Slides, Julien Ah-Pine, Universit´e Lyon 2, 2014  BEARD: https://github.com/inveniosoftware/beard  Scikit-learn: http://scikit-learn.org  Hussein, AL-NATSHEH. "Bibliographic Entity Automatic Recognition and Disambiguation.“, CERN, 2015 https://preprints.cern.ch/record/2036112/files/CERN-THESIS-2015-098.pdf  Buitinck, Lars, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel,Vlad Niculae et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013). http://arxiv.org/pdf/1309.0238.pdf