In this talk I will review several real-world applications and tools developed at the University of Waikato over the past 15 years. The early applications focused on agricultural problems such as cow culling, venison bruising and grass grubs. Following this we looked at the use of near infrared spectroscopy coupled with data mining as an alternate laboratory technique for predicting compound concentrations in soil and plant samples. Our latest application is in the area of gas chromatography mass spectrometry (GCMS), a technique used to determine in environmental applications, for example, the petroleum content in soil and water samples.
2. Outline
• What application development have we done?
• What lessons have we learned?
• What is needed in terms of the future of machine learning
application development?
2
3. Applications – a taxonomy
• UCI data sets – very much like our early agricultural data
• Competition data – usually larger than above, often
difficult
• Signal control applications (often involve reinforcement
learning) – eg autonomous helicopters, vehicles, learning
the signature of a great pianist, learning to sail, learning to
drive racing cars faster, learning to play soccer (often
linked to robotics)
• Key to success = objective measurement – eg Human
Computer Interaction, Speech and Image Recognition,
Computer Games, etc.
3
4. WEKA Waikato Environment for Knowledge Analysis
• Machine Learning at Waikato started in 1993
• Build an interface to enable several ML
methods to be compared on same data
• Explore datasets of importance to the
agricultural sector in NZ
• Apple bruising, Venison bruising, Bull behaviour, Grass
grubs, Pasture production, Pea seed colour, Slugs,
Squash harvest, Wasp nests, White clover persistence
• Cow culling
• Datasets very of the “bring out your dead” variety
4
5. WEKA – unscientific study from Google Scholar
• For the query “WEKA applications”
• Bioinformatics
• Grid Computing
• Medicine
• Business and Finance
• Computer Networks
• Education
5
6. Early lessons learned
• Using WEKA is good but only static solutions are possible
• Datasets need to be large enough to yield significant and
meaningful results
• Datasets involving human judgement tend to be unreliable
6
7. Scientific Equipment Application Methodology
• Obtain samples and reference data from existing
technology (eg wet chemistry) – establish targets Y.
• Process same samples using a proxy (eg NIR) – new X
• Construct new dataset with new X and Y
7
8. Near Infrared Spectroscopy
• Once concept was proven we needed a system to support
commercial use (ie alongside the LIMS)
• Developed S2 (with WEKA interface):
• Used continuously at Hill Laboratories and BLGG
(Holland) since around 2005 – never gone wrong!
• So far it is the best application of the technology that we
have ever come across.
• Faster than wet chemistry
• Predictions can be more accurate
• Large cost savings – multiple analyses per sample
8
10. NIR – lessons learned
• Very lightweight input/output solution using dropbox
methodology was successful as it is transparent and
seamless alongside a LIMS.
• Instrument data is extremely reliable
• In this Industry, compliance is important which implies
that a single algorithm is better than choosing the best
method per dataset.
• As data is abundant, models are rebuilt from time to time.
• No facility for users to develop new applications.
10
11. Gas Chromatography Mass Spectrometry
• Analytical instrument that combines the features of gas
chromatography and mass spectrometry to identify
different substances within a test sample
• Typical Applications
• Environmental monitoring
• Food and beverage analysis
• Criminal forensics (CSI!)
• Drugs/explosives detection
11
16. Solution = Advanced DAta Mining System
• get database IDs of chromatograms
• load chromatograms from DB
• identify and reject outliers
• obtain calibration set information,
check correctness of set
• align with calibration chromatogram,
check correlation
• compound-specific outlier detection
• generate artificial chromatogram
with peaks of compound and spike
compound
• generate output for WEKA
16
17. Limitations and future directions
• What we have seen so far
works with data resident in
memory (RAM) all the time
• This implies a limit can easily
be reached, esp in
applications like GCMS.
• We would like to be able to
learn from potentially infinite
data sources but with finite
memory (RAM).
17
19. Future Directions
• Investigate how to get users to deploy their own DM
solutions
• Implement incremental pre-processing techniques (Joao
has already started!), eg incremental outlier detection.
• Implement incremental algs esp. for regression.
• Encourage work on abstention classifiers, uncertainty
associated with point predictions etc.
• Meta-mine which units of a workflow are useful in tandem
• Investigate fusion: ADAMS with MOA, data
(image+features), tasks (multiview, multitask, transfer)
19