SlideShare a Scribd company logo
1 of 20
Download to read offline
Developing Machine Learning Applications
Geoff Holmes, University of Waikato




                                           1
Outline
• What application development have we done?
• What lessons have we learned?
• What is needed in terms of the future of machine learning
  application development?




                                                          2
Applications – a taxonomy

• UCI data sets – very much like our early agricultural data
• Competition data – usually larger than above, often
  difficult
• Signal control applications (often involve reinforcement
  learning) – eg autonomous helicopters, vehicles, learning
  the signature of a great pianist, learning to sail, learning to
  drive racing cars faster, learning to play soccer (often
  linked to robotics)
• Key to success = objective measurement – eg Human
  Computer Interaction, Speech and Image Recognition,
  Computer Games, etc.
                                                                3
WEKA Waikato Environment for Knowledge Analysis
• Machine Learning at Waikato started in 1993
• Build an interface to enable several ML
  methods to be compared on same data
• Explore datasets of importance to the
  agricultural sector in NZ
   • Apple bruising, Venison bruising, Bull behaviour, Grass
     grubs, Pasture production, Pea seed colour, Slugs,
     Squash harvest, Wasp nests, White clover persistence
   • Cow culling
   • Datasets very of the “bring out your dead” variety
                                                               4
WEKA – unscientific study from Google Scholar
• For the query “WEKA applications”
   • Bioinformatics
   • Grid Computing
   • Medicine
   • Business and Finance
   • Computer Networks
   • Education



                                                5
Early lessons learned

• Using WEKA is good but only static solutions are possible
• Datasets need to be large enough to yield significant and
  meaningful results
• Datasets involving human judgement tend to be unreliable




                                                              6
Scientific Equipment Application Methodology

• Obtain samples and reference data from existing
  technology (eg wet chemistry) – establish targets Y.
• Process same samples using a proxy (eg NIR) – new X
• Construct new dataset with new X and Y




                                                         7
Near Infrared Spectroscopy
• Once concept was proven we needed a system to support
  commercial use (ie alongside the LIMS)
• Developed S2 (with WEKA interface):
   • Used continuously at Hill Laboratories and BLGG
     (Holland) since around 2005 – never gone wrong!
• So far it is the best application of the technology that we
  have ever come across.
• Faster than wet chemistry
• Predictions can be more accurate
• Large cost savings – multiple analyses per sample
                                                                8
S2




     9
NIR – lessons learned

• Very lightweight input/output solution using dropbox
  methodology was successful as it is transparent and
  seamless alongside a LIMS.
• Instrument data is extremely reliable
• In this Industry, compliance is important which implies
  that a single algorithm is better than choosing the best
  method per dataset.
• As data is abundant, models are rebuilt from time to time.
• No facility for users to develop new applications.

                                                             10
Gas Chromatography Mass Spectrometry

• Analytical instrument that combines the features of gas
  chromatography and mass spectrometry to identify
  different substances within a test sample
• Typical Applications
   • Environmental monitoring
   • Food and beverage analysis
   • Criminal forensics (CSI!)
   • Drugs/explosives detection


                                                            11
Example Chromatogram (PAH) – ion counts




                                          12
MS fingerprints




                  13
Machine Learning Approach
• Chromatograms are pre-processed to extract features
• Dataset constructed combining pre-processed
  chromatograms with analyst checked compound
  concentrations
• Learn the relationship between pre-processed
  chromatograms and compound concentrations:
   • extensive pre-processing of data
   • parallel processing – 5000 * 300 values per instance
     (NIR = 1000)
   • pre-processing varies among compounds
                                                            14
Process Requirements




                       15
Solution = Advanced DAta Mining System
                        •   get database IDs of chromatograms
                        •   load chromatograms from DB
                        •   identify and reject outliers
                        •   obtain calibration set information,
                            check correctness of set
                        •   align with calibration chromatogram,
                            check correlation
                        •   compound-specific outlier detection
                        •   generate artificial chromatogram
                            with peaks of compound and spike
                            compound
                        •   generate output for WEKA
                                                                  16
Limitations and future directions
• What we have seen so far
  works with data resident in
  memory (RAM) all the time
• This implies a limit can easily
  be reached, esp in
  applications like GCMS.
• We would like to be able to
  learn from potentially infinite
  data sources but with finite
  memory (RAM).




                                    17
Solution = MOA




                 18
Future Directions

• Investigate how to get users to deploy their own DM
  solutions
• Implement incremental pre-processing techniques (Joao
  has already started!), eg incremental outlier detection.
• Implement incremental algs esp. for regression.
• Encourage work on abstention classifiers, uncertainty
  associated with point predictions etc.
• Meta-mine which units of a workflow are useful in tandem
• Investigate fusion: ADAMS with MOA, data
  (image+features), tasks (multiview, multitask, transfer)
                                                             19
Finally




Questions or
Comments?




               20

More Related Content

Similar to Machine Learning Application Development

Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected homeHéloïse Nonne
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...israel edem
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
Anomaly Detection Using the CLA
Anomaly Detection Using the CLAAnomaly Detection Using the CLA
Anomaly Detection Using the CLANumenta
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.Alexandru Iosup
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through DatabaseNina Jeliazkova
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
Practical cheminformatics workflows with mobile apps
Practical cheminformatics workflows with mobile appsPractical cheminformatics workflows with mobile apps
Practical cheminformatics workflows with mobile appsAlex Clark
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Institute of Contemporary Sciences
 
Machine learning and Autonomous System
Machine learning and Autonomous SystemMachine learning and Autonomous System
Machine learning and Autonomous SystemAnshul Saxena
 
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingMaintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingVladimir Podolskiy
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific DataMarcus Hanwell
 
Distilling dark knowledge from neural networks
Distilling dark knowledge from neural networksDistilling dark knowledge from neural networks
Distilling dark knowledge from neural networksAlexander Korbonits
 

Similar to Machine Learning Application Development (20)

Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Chemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the DesktopChemical Databases and Open Chemistry on the Desktop
Chemical Databases and Open Chemistry on the Desktop
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
Anomaly Detection Using the CLA
Anomaly Detection Using the CLAAnomaly Detection Using the CLA
Anomaly Detection Using the CLA
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
Practical cheminformatics workflows with mobile apps
Practical cheminformatics workflows with mobile appsPractical cheminformatics workflows with mobile apps
Practical cheminformatics workflows with mobile apps
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
Machine learning and Autonomous System
Machine learning and Autonomous SystemMachine learning and Autonomous System
Machine learning and Autonomous System
 
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource SharingMaintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
Maintaining SLOs of Cloud-native Applications via Self-Adaptive Resource Sharing
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Open Source Visualization of Scientific Data
Open Source Visualization of Scientific DataOpen Source Visualization of Scientific Data
Open Source Visualization of Scientific Data
 
Distilling dark knowledge from neural networks
Distilling dark knowledge from neural networksDistilling dark knowledge from neural networks
Distilling dark knowledge from neural networks
 
01-pengantar.pdf
01-pengantar.pdf01-pengantar.pdf
01-pengantar.pdf
 

More from LARCA UPC

Experiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance ClassificationExperiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance ClassificationLARCA UPC
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...LARCA UPC
 
A query language for analyzing networks
A query language for analyzing networksA query language for analyzing networks
A query language for analyzing networksLARCA UPC
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
 
Overlapping correlation clustering
Overlapping correlation clusteringOverlapping correlation clustering
Overlapping correlation clusteringLARCA UPC
 
Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method LARCA UPC
 
Distributed clustering from data streams
Distributed clustering from data streamsDistributed clustering from data streams
Distributed clustering from data streamsLARCA UPC
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataLARCA UPC
 

More from LARCA UPC (8)

Experiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance ClassificationExperiments with Randomisation and Boosting for Multi-instance Classification
Experiments with Randomisation and Boosting for Multi-instance Classification
 
Spectral Learning Methods for Finite State Machines with Applications to Na...
  Spectral Learning Methods for Finite State Machines with Applications to Na...  Spectral Learning Methods for Finite State Machines with Applications to Na...
Spectral Learning Methods for Finite State Machines with Applications to Na...
 
A query language for analyzing networks
A query language for analyzing networksA query language for analyzing networks
A query language for analyzing networks
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
Overlapping correlation clustering
Overlapping correlation clusteringOverlapping correlation clustering
Overlapping correlation clustering
 
Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method Semi-random model tree ensembles: an effective and scalable regression method
Semi-random model tree ensembles: an effective and scalable regression method
 
Distributed clustering from data streams
Distributed clustering from data streamsDistributed clustering from data streams
Distributed clustering from data streams
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Machine Learning Application Development

  • 1. Developing Machine Learning Applications Geoff Holmes, University of Waikato 1
  • 2. Outline • What application development have we done? • What lessons have we learned? • What is needed in terms of the future of machine learning application development? 2
  • 3. Applications – a taxonomy • UCI data sets – very much like our early agricultural data • Competition data – usually larger than above, often difficult • Signal control applications (often involve reinforcement learning) – eg autonomous helicopters, vehicles, learning the signature of a great pianist, learning to sail, learning to drive racing cars faster, learning to play soccer (often linked to robotics) • Key to success = objective measurement – eg Human Computer Interaction, Speech and Image Recognition, Computer Games, etc. 3
  • 4. WEKA Waikato Environment for Knowledge Analysis • Machine Learning at Waikato started in 1993 • Build an interface to enable several ML methods to be compared on same data • Explore datasets of importance to the agricultural sector in NZ • Apple bruising, Venison bruising, Bull behaviour, Grass grubs, Pasture production, Pea seed colour, Slugs, Squash harvest, Wasp nests, White clover persistence • Cow culling • Datasets very of the “bring out your dead” variety 4
  • 5. WEKA – unscientific study from Google Scholar • For the query “WEKA applications” • Bioinformatics • Grid Computing • Medicine • Business and Finance • Computer Networks • Education 5
  • 6. Early lessons learned • Using WEKA is good but only static solutions are possible • Datasets need to be large enough to yield significant and meaningful results • Datasets involving human judgement tend to be unreliable 6
  • 7. Scientific Equipment Application Methodology • Obtain samples and reference data from existing technology (eg wet chemistry) – establish targets Y. • Process same samples using a proxy (eg NIR) – new X • Construct new dataset with new X and Y 7
  • 8. Near Infrared Spectroscopy • Once concept was proven we needed a system to support commercial use (ie alongside the LIMS) • Developed S2 (with WEKA interface): • Used continuously at Hill Laboratories and BLGG (Holland) since around 2005 – never gone wrong! • So far it is the best application of the technology that we have ever come across. • Faster than wet chemistry • Predictions can be more accurate • Large cost savings – multiple analyses per sample 8
  • 9. S2 9
  • 10. NIR – lessons learned • Very lightweight input/output solution using dropbox methodology was successful as it is transparent and seamless alongside a LIMS. • Instrument data is extremely reliable • In this Industry, compliance is important which implies that a single algorithm is better than choosing the best method per dataset. • As data is abundant, models are rebuilt from time to time. • No facility for users to develop new applications. 10
  • 11. Gas Chromatography Mass Spectrometry • Analytical instrument that combines the features of gas chromatography and mass spectrometry to identify different substances within a test sample • Typical Applications • Environmental monitoring • Food and beverage analysis • Criminal forensics (CSI!) • Drugs/explosives detection 11
  • 12. Example Chromatogram (PAH) – ion counts 12
  • 14. Machine Learning Approach • Chromatograms are pre-processed to extract features • Dataset constructed combining pre-processed chromatograms with analyst checked compound concentrations • Learn the relationship between pre-processed chromatograms and compound concentrations: • extensive pre-processing of data • parallel processing – 5000 * 300 values per instance (NIR = 1000) • pre-processing varies among compounds 14
  • 16. Solution = Advanced DAta Mining System • get database IDs of chromatograms • load chromatograms from DB • identify and reject outliers • obtain calibration set information, check correctness of set • align with calibration chromatogram, check correlation • compound-specific outlier detection • generate artificial chromatogram with peaks of compound and spike compound • generate output for WEKA 16
  • 17. Limitations and future directions • What we have seen so far works with data resident in memory (RAM) all the time • This implies a limit can easily be reached, esp in applications like GCMS. • We would like to be able to learn from potentially infinite data sources but with finite memory (RAM). 17
  • 19. Future Directions • Investigate how to get users to deploy their own DM solutions • Implement incremental pre-processing techniques (Joao has already started!), eg incremental outlier detection. • Implement incremental algs esp. for regression. • Encourage work on abstention classifiers, uncertainty associated with point predictions etc. • Meta-mine which units of a workflow are useful in tandem • Investigate fusion: ADAMS with MOA, data (image+features), tasks (multiview, multitask, transfer) 19