Machine Learning Application Development

Developing Machine Learning Applications
Geoff Holmes, University of Waikato

1

Outline
• What application development have we done?
• What lessons have we learned?
• What is needed in terms of the future of machine learning
application development?

2

Applications – a taxonomy

• UCI data sets – very much like our early agricultural data
• Competition data – usually larger than above, often
difficult
• Signal control applications (often involve reinforcement
learning) – eg autonomous helicopters, vehicles, learning
the signature of a great pianist, learning to sail, learning to
drive racing cars faster, learning to play soccer (often
linked to robotics)
• Key to success = objective measurement – eg Human
Computer Interaction, Speech and Image Recognition,
Computer Games, etc.
3

WEKA Waikato Environment for Knowledge Analysis
• Machine Learning at Waikato started in 1993
• Build an interface to enable several ML
methods to be compared on same data
• Explore datasets of importance to the
agricultural sector in NZ
• Apple bruising, Venison bruising, Bull behaviour, Grass
grubs, Pasture production, Pea seed colour, Slugs,
Squash harvest, Wasp nests, White clover persistence
• Cow culling
• Datasets very of the “bring out your dead” variety
4

WEKA – unscientific study from Google Scholar
• For the query “WEKA applications”
• Bioinformatics
• Grid Computing
• Medicine
• Business and Finance
• Computer Networks
• Education

5

Early lessons learned

• Using WEKA is good but only static solutions are possible
• Datasets need to be large enough to yield significant and
meaningful results
• Datasets involving human judgement tend to be unreliable

6

Scientific Equipment Application Methodology

• Obtain samples and reference data from existing
technology (eg wet chemistry) – establish targets Y.
• Process same samples using a proxy (eg NIR) – new X
• Construct new dataset with new X and Y

7

Near Infrared Spectroscopy
• Once concept was proven we needed a system to support
commercial use (ie alongside the LIMS)
• Developed S2 (with WEKA interface):
• Used continuously at Hill Laboratories and BLGG
(Holland) since around 2005 – never gone wrong!
• So far it is the best application of the technology that we
have ever come across.
• Faster than wet chemistry
• Predictions can be more accurate
• Large cost savings – multiple analyses per sample
8

NIR – lessons learned

• Very lightweight input/output solution using dropbox
methodology was successful as it is transparent and
seamless alongside a LIMS.
• Instrument data is extremely reliable
• In this Industry, compliance is important which implies
that a single algorithm is better than choosing the best
method per dataset.
• As data is abundant, models are rebuilt from time to time.
• No facility for users to develop new applications.

10

Gas Chromatography Mass Spectrometry

• Analytical instrument that combines the features of gas
chromatography and mass spectrometry to identify
different substances within a test sample
• Typical Applications
• Environmental monitoring
• Food and beverage analysis
• Criminal forensics (CSI!)
• Drugs/explosives detection

11

Example Chromatogram (PAH) – ion counts

12

Machine Learning Approach
• Chromatograms are pre-processed to extract features
• Dataset constructed combining pre-processed
chromatograms with analyst checked compound
concentrations
• Learn the relationship between pre-processed
chromatograms and compound concentrations:
• extensive pre-processing of data
• parallel processing – 5000 * 300 values per instance
(NIR = 1000)
• pre-processing varies among compounds
14

Process Requirements

15

Solution = Advanced DAta Mining System
• get database IDs of chromatograms
• load chromatograms from DB
• identify and reject outliers
• obtain calibration set information,
check correctness of set
• align with calibration chromatogram,
check correlation
• compound-specific outlier detection
• generate artificial chromatogram
with peaks of compound and spike
compound
• generate output for WEKA
16

Limitations and future directions
• What we have seen so far
works with data resident in
memory (RAM) all the time
• This implies a limit can easily
be reached, esp in
applications like GCMS.
• We would like to be able to
learn from potentially infinite
data sources but with finite
memory (RAM).

17

Future Directions

• Investigate how to get users to deploy their own DM
solutions
• Implement incremental pre-processing techniques (Joao
has already started!), eg incremental outlier detection.
• Implement incremental algs esp. for regression.
• Encourage work on abstention classifiers, uncertainty
associated with point predictions etc.
• Meta-mine which units of a workflow are useful in tandem
• Investigate fusion: ADAMS with MOA, data
(image+features), tasks (multiview, multitask, transfer)
19

Finally

Questions or
Comments?

20

Machine Learning Application Development

More Related Content

Similar to Machine Learning Application Development

More from LARCA UPC

Recently uploaded

Machine Learning Application Development