1. Processing Large ToF-SIMS Datasets
Wednesday, 20 September 2017 1
Gustavo Ferraz Trindade, Marie-Laure Abel and John F. Watts
The Surface Analysis Laboratory, University of Surrey, UK
3. Wednesday, 20 September 2017 3
ToF-SIMS data is โgrowing upโ
Introduction
Most surface analysis laboratories ToF-SIMS spectrometers in dual beam depth profile mode
will typically generate hyperspectral image datasets distributed throughout a 3D cube
containing more than 256 x 256 x 500 voxels with each voxel containing from 20,000 to
2,000,000 spectral channels.
4. Wednesday, 20 September 2017 4
ToF-SIMS data analysis is โgrowing upโ
Keywords โSIMSโ and โPCAโ @ web of science
Introduction
5. Wednesday, 20 September 2017
5
Binning voxels and channels, Peak picking, standard approaches
Surrey Matlab GUI developed by Gustavo Ferraz Trindade
Introduction
s i m s M V A
www.mvatools.com
6. Wednesday, 20 September 2017 7
New trend in Surface Analysis community of processing full datasets
- Random vectors algorithm + GPU
- Focus on PCA only
Introduction
Surf. Interface. Anal 2016
10.1002/sia.6042
Surf. Interface. Anal 2015
10.1002/sia.5800
7. Wednesday, 20 September 2017 8
My contribution/objective: perform Non-negative matrix
factorisation (NMF aka MCR) on unbinned datasets
Introduction
8. Wednesday, 20 September 2017 9
Example dataset
Surface segregation of polymer additives X
Large area scan of chemically contaminated fingerprint on silicon wafer
- Great interest from forensics
- Surrey has experience in it
Experimental
Analyst, 2015, 140, 6254
Analyst, 2013, 138, 6246
Surf. Interface Anal. 2010, 42, 826โ829
9. Wednesday, 20 September 2017 10
Each patch will have 100 x 100 pixels (500 x 500 um2)
20 patches were done in a total area of 1 x 1 cm2 (pixel density 0.06 px/um2 )
Each spectrum has 2,000,000 channels
Resulting dataset has 4M x 2M = 8x1012 data points!
Extremely sparse (< 1% non-zero elements)
Great challenge for multivariate analysis
Iontof TOF.SIMS 5
High current bunched mode
25 keV Bi3
+
0.3โpA, 10โkHz
Negative secondary ions
10 scans per patch
Experimental
10. Wednesday, 20 September 2017 11
General Raw Data (.GRD)
Scan x y tof
Directly loading into pre-allocated sparse arrays in
Matlab 2016a
Resulting data is arranged in matrix A sized 4M x 2M
containing the 4M spectra of every single pixel, with 2M
spectral channels each.
Processing
11. Wednesday, 20 September 2017 12
the method of choice was
Non-negative matrix factorisation (NMF) a.k.a. MCR
Multiplicative update algorithms (Lee & Seung - 2001)
Processing
NATURE|VOL 401 | 1999 |
12. Wednesday, 20 September 2017 13
A (NxM) W*H
= =
(+ error)
(2 โpure componentsโ)
(3 โpure componentsโ)(4 โpure componentsโ)(5 โpure componentsโ)
and so onโฆ
Weights
Weights
Pure spectrum Pure spectrum
+
Processing
13. Wednesday, 20 September 2017 14
To overcome time and memory limitations:
Sub-sampling using Sobol sequences
Processing
Surf. Interface. Anal 2016
14. Wednesday, 20 September 2017 15
Results
Component 1
Component 3
Component 2
265 u: Sodium dodecyl sulphate
SO2
-
SO3
-
SO4H- C29H28O4
-
NaS2O7-
OH-
SiO2
-
Data size: 4M x 1.3M
Subsample size: 15,000 x 1.3M
Iterations: 500
Time/iteration: 36s
FOV: 1 cm x 1cm
15. Wednesday, 20 September 2017 16
Spectrum of a single pixel
In spite of the fact that the dataset has
very few counts per pixel, NMF was
successfully achieved.
Advantage of performing multivariate
analysis on noisy, very large datasets.
A pixel by pixel view will not contain
relevant information but the whole data
would still have latent structure and be
able to undergo factorisation without
binning.
Results
16. Wednesday, 20 September 2017
17
Since the secondary ions analysed were
negatively charged, the Si- and SiO- peaks have
very low intensity.
Even so NMF managed to separate them
perfectly from the fingerprint signal
Reinforces the advantage of using unbinned
datasets when it comes to finding hidden
features.
Si-
SiO-
Results
17. Wednesday, 20 September 2017
18
Systematic misalignment of ALL peaks for
components 1 and 2
- Topography of deposited fingerprint?
- Non-perfect primary ions TOF correction?
Image
zoomed in
on 9 patches
Results
SO3
-
18. Wednesday, 20 September 2017 19
To overcome misalignment problem
- Better sample preparation
- Review primary ions tof correction
- Data based only methods:
Align channel by channel to
a reference pixel (warping)
Time consuming. Quickest found method
takes minutes per spectrum
Apply fixed shift
(misalignment due to height differences)
Only a few counts per pixel. Impossible to
identify peak positions
Results
19. Wednesday, 20 September 2017 20
Third approach for alignment (that would not need good statistics per pixel)
- Perform alignment on NMF components (matrix H) and reconstruct back
๐ด
๐๐๐น
= ๐๐ป + ๐๐๐๐๐
๐ด๐ฟ๐ผ๐บ๐๐๐ธ๐๐
๐ป๐ด = H + S
๐ ๐ธ๐ถ๐๐๐๐๐ ๐๐ถ๐๐ผ๐๐
๐ด ๐ด = ๐๐ป๐ด + ๐๐๐๐๐ = ๐๐ป + ๐๐ + ๐๐๐๐๐ = ๐ด + ๐๐
AA = A + WS
Correction matrix for A would be shift matrix โSโ (obtained from matrix H) weighted by
relative concentrations of pure components (matrix W)
- It seems to work with โsimulated dataโ but we are still not sure whether it is mathematically
correct
- Small problem: this would require to process the entire matrix A (no subsampling)
Results
A
W
H HA
AA
(aligned)
NMF
Alignment
Reconstruction
NMF
(again)
20. Wednesday, 20 September 2017 21
- It seems to work with โsimulated dataโ but we are still not sure whether it is appropriate
Results
H Matrix before and after alignment W Matrix (overlay of 3 components) before alignment
and after alignment + reconstruction + NMF
- Small problem: this would require to process the entire matrix A (no subsampling)
Before
After
21. Wednesday, 20 September 2017 22
Results
H Matrix W Matrix (overlay of 3 components)
Before
alignment
After
alignment
22. Wednesday, 20 September 2017
Good approach for NMF of sparse
giant matrices: Map/Reduce
- Introduced by Google in 2004
- Added to Matlab in version 2014b
- Still used in several Big Data
applications
Map/Reduce
Analyse full dataset
Data wonโt fit in PCs memory: Requires different method
OSDI 2004
24. Wednesday, 20 September 2017 25
- Map/Reduce NMF
- Multiplicative update algorithm
in map/reduce framework
- Implementation in Matlab R2016a:
challenge due to lack of documentation
Map/Reduce
Proceedings of the 19th international conference on
World wide web WWW 10
25. Wednesday, 20 September 2017 26
History of implementations in Matlab
Time per iteration (4 workers) x number of elements x sparsity
Same dataset
~ 10x faster
There is room for
improvement!!
Map/Reduce
26. Wednesday, 20 September 2017 27
Comparison between map/reduce and standard NMF
Adhesive sample
Data 32x32x20000, 150 iterations, same IC
Map/Reduce Standard
Map/Reduce
27. Wednesday, 20 September 2017 28
Conclusions!?
๏ง ToF-SIMS data will not stop growing and we have to
consider ways to go about processing it
๏ง NMF of a large ToF-SIMS dataset has been achieved with
sparse allocation and subsampling
๏ง Hidden features and weak signals can be identified
when unbinned datasets are processed
๏ง For even larger datasets or to align peaks via reconstruction:
MapReduce may be a way to go
๏ง Deals with data in chunks
๏ง well defined framework
๏ง easily scalable up to large computer clusters
NPL 3D Nano SIMS