Processing Large ToF-SIMS Datasets

Processing Large ToF-SIMS Datasets
Wednesday, 20 September 2017 1
Gustavo Ferraz Trindade, Marie-Laure Abel and John F. Watts
The Surface Analysis Laboratory, University of Surrey, UK

Outline
!?
Introduction Experimental Processing
Results Alternative Conclusions

ToF-SIMS data is “growing up”
Introduction
Most surface analysis laboratories ToF-SIMS spectrometers in dual beam depth profile mode
will typically generate hyperspectral image datasets distributed throughout a 3D cube
containing more than 256 x 256 x 500 voxels with each voxel containing from 20,000 to
2,000,000 spectral channels.

ToF-SIMS data analysis is “growing up”
Keywords “SIMS” and “PCA” @ web of science
Introduction

Wednesday, 20 September 2017
5
Binning voxels and channels, Peak picking, standard approaches
Surrey Matlab GUI developed by Gustavo Ferraz Trindade
Introduction
s i m s M V A
www.mvatools.com

New trend in Surface Analysis community of processing full datasets
- Random vectors algorithm + GPU
- Focus on PCA only
Introduction
Surf. Interface. Anal 2016
10.1002/sia.6042
10.1002/sia.5800

My contribution/objective: perform Non-negative matrix
factorisation (NMF aka MCR) on unbinned datasets
Introduction

Example dataset
Surface segregation of polymer additives X
Large area scan of chemically contaminated fingerprint on silicon wafer
- Great interest from forensics
- Surrey has experience in it
Experimental
Analyst, 2015, 140, 6254
Analyst, 2013, 138, 6246
Surf. Interface Anal. 2010, 42, 826–829

Each patch will have 100 x 100 pixels (500 x 500 um2)
20 patches were done in a total area of 1 x 1 cm2 (pixel density 0.06 px/um2 )
Each spectrum has 2,000,000 channels
Resulting dataset has 4M x 2M = 8x1012 data points!
Extremely sparse (< 1% non-zero elements)
Great challenge for multivariate analysis
Iontof TOF.SIMS 5
High current bunched mode
25 keV Bi3
+
0.3 pA, 10 kHz
Negative secondary ions
10 scans per patch
Experimental

General Raw Data (.GRD)
Scan x y tof
Directly loading into pre-allocated sparse arrays in
Matlab 2016a
Resulting data is arranged in matrix A sized 4M x 2M
containing the 4M spectra of every single pixel, with 2M
spectral channels each.
Processing

the method of choice was
Non-negative matrix factorisation (NMF) a.k.a. MCR
Multiplicative update algorithms (Lee & Seung - 2001)
Processing
NATURE|VOL 401 | 1999 |

A (NxM) W*H
= =
(+ error)
(2 “pure components”)
(3 “pure components”)(4 “pure components”)(5 “pure components”)
and so on…
Weights
Weights
Pure spectrum Pure spectrum
+
Processing

To overcome time and memory limitations:
Sub-sampling using Sobol sequences
Processing

Results
Component 1
Component 3
Component 2
265 u: Sodium dodecyl sulphate
SO2
-
SO3
-
SO4H- C29H28O4
-
NaS2O7-
OH-
SiO2
-
Data size: 4M x 1.3M
Subsample size: 15,000 x 1.3M
Iterations: 500
Time/iteration: 36s
FOV: 1 cm x 1cm

Spectrum of a single pixel
In spite of the fact that the dataset has
very few counts per pixel, NMF was
successfully achieved.
Advantage of performing multivariate
analysis on noisy, very large datasets.
A pixel by pixel view will not contain
relevant information but the whole data
would still have latent structure and be
able to undergo factorisation without
binning.
Results

17
Since the secondary ions analysed were
negatively charged, the Si- and SiO- peaks have
very low intensity.
Even so NMF managed to separate them
perfectly from the fingerprint signal
Reinforces the advantage of using unbinned
datasets when it comes to finding hidden
features.
Si-
SiO-
Results

18
Systematic misalignment of ALL peaks for
components 1 and 2
- Topography of deposited fingerprint?
- Non-perfect primary ions TOF correction?
Image
zoomed in
on 9 patches
Results
SO3
-

To overcome misalignment problem
- Better sample preparation
- Review primary ions tof correction
- Data based only methods:
Align channel by channel to
a reference pixel (warping)
Time consuming. Quickest found method
takes minutes per spectrum
Apply fixed shift
(misalignment due to height differences)
Only a few counts per pixel. Impossible to
identify peak positions
Results

Third approach for alignment (that would not need good statistics per pixel)
- Perform alignment on NMF components (matrix H) and reconstruct back
𝐴
𝑁𝑀𝐹
= 𝑊𝐻 + 𝑒𝑟𝑟𝑜𝑟
𝐴𝐿𝐼𝐺𝑁𝑀𝐸𝑁𝑇
𝐻𝐴 = H + S
𝑅𝐸𝐶𝑂𝑁𝑆𝑇𝑅𝑈𝐶𝑇𝐼𝑂𝑁
𝐴 𝐴 = 𝑊𝐻𝐴 + 𝑒𝑟𝑟𝑜𝑟 = 𝑊𝐻 + 𝑊𝑆 + 𝑒𝑟𝑟𝑜𝑟 = 𝐴 + 𝑊𝑆
AA = A + WS
Correction matrix for A would be shift matrix “S” (obtained from matrix H) weighted by
relative concentrations of pure components (matrix W)
- It seems to work with “simulated data” but we are still not sure whether it is mathematically
correct
- Small problem: this would require to process the entire matrix A (no subsampling)
Results
A
W
H HA
AA
(aligned)
NMF
Alignment
Reconstruction
NMF
(again)

- It seems to work with “simulated data” but we are still not sure whether it is appropriate
Results
H Matrix before and after alignment W Matrix (overlay of 3 components) before alignment
and after alignment + reconstruction + NMF
- Small problem: this would require to process the entire matrix A (no subsampling)
Before
After

Results
H Matrix W Matrix (overlay of 3 components)
Before
alignment
After
alignment

Good approach for NMF of sparse
giant matrices: Map/Reduce
- Introduced by Google in 2004
- Added to Matlab in version 2014b
- Still used in several Big Data
applications
Map/Reduce
Analyse full dataset
Data won’t fit in PCs memory: Requires different method
OSDI 2004

Map/Reduce
Map/Reduce

- Map/Reduce NMF
- Multiplicative update algorithm
in map/reduce framework
- Implementation in Matlab R2016a:
challenge due to lack of documentation
Map/Reduce
Proceedings of the 19th international conference on
World wide web WWW 10

History of implementations in Matlab
Time per iteration (4 workers) x number of elements x sparsity
Same dataset
~ 10x faster
There is room for
improvement!!
Map/Reduce

Comparison between map/reduce and standard NMF
Adhesive sample
Data 32x32x20000, 150 iterations, same IC
Map/Reduce Standard
Map/Reduce

Conclusions!?
 ToF-SIMS data will not stop growing and we have to
consider ways to go about processing it
 NMF of a large ToF-SIMS dataset has been achieved with
sparse allocation and subsampling
 Hidden features and weak signals can be identified
when unbinned datasets are processed
 For even larger datasets or to align peaks via reconstruction:
MapReduce may be a way to go
 Deals with data in chunks
 well defined framework
 easily scalable up to large computer clusters
NPL 3D Nano SIMS

Processing Large ToF-SIMS Datasets

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Processing Large ToF-SIMS Datasets

Similar to Processing Large ToF-SIMS Datasets (20)

Recently uploaded

Recently uploaded (20)

Processing Large ToF-SIMS Datasets

Editor's Notes