Developing Machine Learning Applications
Geoff Holmes, University of Waikato




                                           1
Outline
• What application development have we done?
• What lessons have we learned?
• What is needed in terms of the future of machine learning
  application development?




                                                          2
Applications – a taxonomy

• UCI data sets – very much like our early agricultural data
• Competition data – usually larger than above, often
  difficult
• Signal control applications (often involve reinforcement
  learning) – eg autonomous helicopters, vehicles, learning
  the signature of a great pianist, learning to sail, learning to
  drive racing cars faster, learning to play soccer (often
  linked to robotics)
• Key to success = objective measurement – eg Human
  Computer Interaction, Speech and Image Recognition,
  Computer Games, etc.
                                                                3
WEKA Waikato Environment for Knowledge Analysis
• Machine Learning at Waikato started in 1993
• Build an interface to enable several ML
  methods to be compared on same data
• Explore datasets of importance to the
  agricultural sector in NZ
   • Apple bruising, Venison bruising, Bull behaviour, Grass
     grubs, Pasture production, Pea seed colour, Slugs,
     Squash harvest, Wasp nests, White clover persistence
   • Cow culling
   • Datasets very of the “bring out your dead” variety
                                                               4
WEKA – unscientific study from Google Scholar
• For the query “WEKA applications”
   • Bioinformatics
   • Grid Computing
   • Medicine
   • Business and Finance
   • Computer Networks
   • Education



                                                5
Early lessons learned

• Using WEKA is good but only static solutions are possible
• Datasets need to be large enough to yield significant and
  meaningful results
• Datasets involving human judgement tend to be unreliable




                                                              6
Scientific Equipment Application Methodology

• Obtain samples and reference data from existing
  technology (eg wet chemistry) – establish targets Y.
• Process same samples using a proxy (eg NIR) – new X
• Construct new dataset with new X and Y




                                                         7
Near Infrared Spectroscopy
• Once concept was proven we needed a system to support
  commercial use (ie alongside the LIMS)
• Developed S2 (with WEKA interface):
   • Used continuously at Hill Laboratories and BLGG
     (Holland) since around 2005 – never gone wrong!
• So far it is the best application of the technology that we
  have ever come across.
• Faster than wet chemistry
• Predictions can be more accurate
• Large cost savings – multiple analyses per sample
                                                                8
S2




     9
NIR – lessons learned

• Very lightweight input/output solution using dropbox
  methodology was successful as it is transparent and
  seamless alongside a LIMS.
• Instrument data is extremely reliable
• In this Industry, compliance is important which implies
  that a single algorithm is better than choosing the best
  method per dataset.
• As data is abundant, models are rebuilt from time to time.
• No facility for users to develop new applications.

                                                             10
Gas Chromatography Mass Spectrometry

• Analytical instrument that combines the features of gas
  chromatography and mass spectrometry to identify
  different substances within a test sample
• Typical Applications
   • Environmental monitoring
   • Food and beverage analysis
   • Criminal forensics (CSI!)
   • Drugs/explosives detection


                                                            11
Example Chromatogram (PAH) – ion counts




                                          12
MS fingerprints




                  13
Machine Learning Approach
• Chromatograms are pre-processed to extract features
• Dataset constructed combining pre-processed
  chromatograms with analyst checked compound
  concentrations
• Learn the relationship between pre-processed
  chromatograms and compound concentrations:
   • extensive pre-processing of data
   • parallel processing – 5000 * 300 values per instance
     (NIR = 1000)
   • pre-processing varies among compounds
                                                            14
Process Requirements




                       15
Solution = Advanced DAta Mining System
                        •   get database IDs of chromatograms
                        •   load chromatograms from DB
                        •   identify and reject outliers
                        •   obtain calibration set information,
                            check correctness of set
                        •   align with calibration chromatogram,
                            check correlation
                        •   compound-specific outlier detection
                        •   generate artificial chromatogram
                            with peaks of compound and spike
                            compound
                        •   generate output for WEKA
                                                                  16
Limitations and future directions
• What we have seen so far
  works with data resident in
  memory (RAM) all the time
• This implies a limit can easily
  be reached, esp in
  applications like GCMS.
• We would like to be able to
  learn from potentially infinite
  data sources but with finite
  memory (RAM).




                                    17
Solution = MOA




                 18
Future Directions

• Investigate how to get users to deploy their own DM
  solutions
• Implement incremental pre-processing techniques (Joao
  has already started!), eg incremental outlier detection.
• Implement incremental algs esp. for regression.
• Encourage work on abstention classifiers, uncertainty
  associated with point predictions etc.
• Meta-mine which units of a workflow are useful in tandem
• Investigate fusion: ADAMS with MOA, data
  (image+features), tasks (multiview, multitask, transfer)
                                                             19
Finally




Questions or
Comments?




               20

Machine Learning Application Development

  • 1.
    Developing Machine LearningApplications Geoff Holmes, University of Waikato 1
  • 2.
    Outline • What applicationdevelopment have we done? • What lessons have we learned? • What is needed in terms of the future of machine learning application development? 2
  • 3.
    Applications – ataxonomy • UCI data sets – very much like our early agricultural data • Competition data – usually larger than above, often difficult • Signal control applications (often involve reinforcement learning) – eg autonomous helicopters, vehicles, learning the signature of a great pianist, learning to sail, learning to drive racing cars faster, learning to play soccer (often linked to robotics) • Key to success = objective measurement – eg Human Computer Interaction, Speech and Image Recognition, Computer Games, etc. 3
  • 4.
    WEKA Waikato Environmentfor Knowledge Analysis • Machine Learning at Waikato started in 1993 • Build an interface to enable several ML methods to be compared on same data • Explore datasets of importance to the agricultural sector in NZ • Apple bruising, Venison bruising, Bull behaviour, Grass grubs, Pasture production, Pea seed colour, Slugs, Squash harvest, Wasp nests, White clover persistence • Cow culling • Datasets very of the “bring out your dead” variety 4
  • 5.
    WEKA – unscientificstudy from Google Scholar • For the query “WEKA applications” • Bioinformatics • Grid Computing • Medicine • Business and Finance • Computer Networks • Education 5
  • 6.
    Early lessons learned •Using WEKA is good but only static solutions are possible • Datasets need to be large enough to yield significant and meaningful results • Datasets involving human judgement tend to be unreliable 6
  • 7.
    Scientific Equipment ApplicationMethodology • Obtain samples and reference data from existing technology (eg wet chemistry) – establish targets Y. • Process same samples using a proxy (eg NIR) – new X • Construct new dataset with new X and Y 7
  • 8.
    Near Infrared Spectroscopy •Once concept was proven we needed a system to support commercial use (ie alongside the LIMS) • Developed S2 (with WEKA interface): • Used continuously at Hill Laboratories and BLGG (Holland) since around 2005 – never gone wrong! • So far it is the best application of the technology that we have ever come across. • Faster than wet chemistry • Predictions can be more accurate • Large cost savings – multiple analyses per sample 8
  • 9.
    S2 9
  • 10.
    NIR – lessonslearned • Very lightweight input/output solution using dropbox methodology was successful as it is transparent and seamless alongside a LIMS. • Instrument data is extremely reliable • In this Industry, compliance is important which implies that a single algorithm is better than choosing the best method per dataset. • As data is abundant, models are rebuilt from time to time. • No facility for users to develop new applications. 10
  • 11.
    Gas Chromatography MassSpectrometry • Analytical instrument that combines the features of gas chromatography and mass spectrometry to identify different substances within a test sample • Typical Applications • Environmental monitoring • Food and beverage analysis • Criminal forensics (CSI!) • Drugs/explosives detection 11
  • 12.
    Example Chromatogram (PAH)– ion counts 12
  • 13.
  • 14.
    Machine Learning Approach •Chromatograms are pre-processed to extract features • Dataset constructed combining pre-processed chromatograms with analyst checked compound concentrations • Learn the relationship between pre-processed chromatograms and compound concentrations: • extensive pre-processing of data • parallel processing – 5000 * 300 values per instance (NIR = 1000) • pre-processing varies among compounds 14
  • 15.
  • 16.
    Solution = AdvancedDAta Mining System • get database IDs of chromatograms • load chromatograms from DB • identify and reject outliers • obtain calibration set information, check correctness of set • align with calibration chromatogram, check correlation • compound-specific outlier detection • generate artificial chromatogram with peaks of compound and spike compound • generate output for WEKA 16
  • 17.
    Limitations and futuredirections • What we have seen so far works with data resident in memory (RAM) all the time • This implies a limit can easily be reached, esp in applications like GCMS. • We would like to be able to learn from potentially infinite data sources but with finite memory (RAM). 17
  • 18.
  • 19.
    Future Directions • Investigatehow to get users to deploy their own DM solutions • Implement incremental pre-processing techniques (Joao has already started!), eg incremental outlier detection. • Implement incremental algs esp. for regression. • Encourage work on abstention classifiers, uncertainty associated with point predictions etc. • Meta-mine which units of a workflow are useful in tandem • Investigate fusion: ADAMS with MOA, data (image+features), tasks (multiview, multitask, transfer) 19
  • 20.