This document discusses processing large time-of-flight secondary ion mass spectrometry (ToF-SIMS) datasets. ToF-SIMS spectrometers can generate hyperspectral image datasets containing millions of voxels and spectral channels. Non-negative matrix factorization (NMF) was performed on an unbinned ToF-SIMS dataset containing over 8 trillion data points from a fingerprint contamination sample. Subsampling was used to address memory limitations. NMF separated fingerprint components from silicon peaks and identified systematic peak misalignments. MapReduce processing is proposed for even larger datasets to distribute computations across nodes.
Ivan Sahumbaiev "Deep Learning approaches meet 3D data"Fwdays
During this talk, I’d be talking about how 3d data can be processed with Deep Learning models. The main focus would be on Point Clouds.
Session agenda:
What are 3D data and its representation
Overview of libraries to visualize and process
How to collect. Cameras. Calibration
The current state of the art for point cloud processing with Deep Learning models.
classification problem. Models to use
segmentation problem. Models to use
datasets. Losses and training routine
Point clouds correspondences
spectral methods to generate correspondences
Limitations.
State of the Map US 2018: Analytic Support to Mapping Contributorsrlewis48
Significant advances in machine learning techniques for image classification, object detection and image segmentation have profound implications for crowdsourced mapping applications. Recent open source initiatives such as SpaceNet have strived to direct more research and development towards specific foundational mapping functions such as building detection and road network and routing identification. As these machine learning techniques mature, mapping contributors need to understand and engage the research community to help structure the application of these new techniques against a diverse of mapping challenges. Yet, currently, it is difficult translate mapping requirements to machine learning evaluation metrics, and vice versa. This presentation will discuss a proposed framework for defining levels of analyst augmentation that will allow mapping contributors and machine learning researchers to better understand each other and help direct the application of these advanced algorithms against mapping problems. Specifically, it will focus on relevant use case of mapping requirements, before, during and after a natural disaster and demonstrate a framework to understand what capabilities are nearing readiness.
QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and...nishimurashoji
Presentation Slides at SIGMOD 2017
Talk video: https://www.youtube.com/watch?v=dHNsZnjwgww
My talk starts around 1:21:45.
Paper: https://dl.acm.org/citation.cfm?id=3035934&CFID=1010432390&CFTOKEN=34002366
Ivan Sahumbaiev "Deep Learning approaches meet 3D data"Fwdays
During this talk, I’d be talking about how 3d data can be processed with Deep Learning models. The main focus would be on Point Clouds.
Session agenda:
What are 3D data and its representation
Overview of libraries to visualize and process
How to collect. Cameras. Calibration
The current state of the art for point cloud processing with Deep Learning models.
classification problem. Models to use
segmentation problem. Models to use
datasets. Losses and training routine
Point clouds correspondences
spectral methods to generate correspondences
Limitations.
State of the Map US 2018: Analytic Support to Mapping Contributorsrlewis48
Significant advances in machine learning techniques for image classification, object detection and image segmentation have profound implications for crowdsourced mapping applications. Recent open source initiatives such as SpaceNet have strived to direct more research and development towards specific foundational mapping functions such as building detection and road network and routing identification. As these machine learning techniques mature, mapping contributors need to understand and engage the research community to help structure the application of these new techniques against a diverse of mapping challenges. Yet, currently, it is difficult translate mapping requirements to machine learning evaluation metrics, and vice versa. This presentation will discuss a proposed framework for defining levels of analyst augmentation that will allow mapping contributors and machine learning researchers to better understand each other and help direct the application of these advanced algorithms against mapping problems. Specifically, it will focus on relevant use case of mapping requirements, before, during and after a natural disaster and demonstrate a framework to understand what capabilities are nearing readiness.
QUILTS: Multidimensional Data Partitioning Framework Based on Query-Aware and...nishimurashoji
Presentation Slides at SIGMOD 2017
Talk video: https://www.youtube.com/watch?v=dHNsZnjwgww
My talk starts around 1:21:45.
Paper: https://dl.acm.org/citation.cfm?id=3035934&CFID=1010432390&CFTOKEN=34002366
Final Presentation given at the conclusion of the 2018 IMSM by the MIT Lincoln Labs Student Working Group.
Group Members: Michael Byrne, Fatoumata Sanogo, Pai Song, Kevin Tsai, Hang Yang and Li Zhu
We use the Georeferenced results of the 2010 Census in Mexico to train machine learning algorithms to detect growth in cities and contribute new information to estimate the total population.
Sharing the experience and results of using georeferenced 2010 Census data in Mexico and EO to train algorithms in order to detect urban growth and generate useful information for estimating population for non-census years.
Geographic Information Systems (October – 2015) [Question Paper | CBSGS: 75:2...Mumbai B.Sc.IT Study
Geographic Information Systems (October – 2015) [Question Paper | CBSGS: 75:25 Pattern]
april - 2017, april - 2016, april - 2015, april - 2014, april - 2013, october - 2017, october - 2016, october - 2015, october - 2014, may - 2016, may - 2017, december - 2017, 75:25 pattern, 60:40 pattern, revised course, old course, mumbai bscit study, mumbai university, bscit semester vi, bscit question paper, old question paper, previous year question paper, semester vi question paper, question paper, CBSGS, IDOL, kamal t, internet technology, digital signals and systems, data warehousing, ipr and cyber laws, project management, geographic information system
Nas net where model learn to generate modelsKhang Pham
Walk through NAS net and a few papers applied NAS search space as well as the approach for architecture search to achieve SOTA in accuracy for ImageNet
Deep Learning Applications to Satellite Imageryrlewis48
These are the slides from Intel's AI DevCon 2018 Conference. The video from the workshop is available online.The last few years has seen a significant increase in the launch of commercial and federal satellite imaging platforms. As these data become more widely available, so too have the data science challenges and research opportunities. In this hands-on workshop, CosmiQ Works and Intel AI Lab will introduce the business use cases and research questions around leveraging this imagery, as well as helpful tools and datasets to ease the friction. We will guide attendees through a hands-on exercise using the tools to train a small network on Intel® Xeon® Processors to detect buildings or road networks using the SpaceNet™ dataset. Join us to learn how to explore this exciting area of applied deep learning.
5 Steps to Improve your Active Travel CommunicationsPindar Creative
5 Steps to Improve your Active Travel Communications by Maria Heaman of Pindar Creative focusing on primary and secondary schools.
Third revision - 19 January 2017
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS On the spectrum of the plenoptic f...IEEEBEBTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Final Presentation given at the conclusion of the 2018 IMSM by the MIT Lincoln Labs Student Working Group.
Group Members: Michael Byrne, Fatoumata Sanogo, Pai Song, Kevin Tsai, Hang Yang and Li Zhu
We use the Georeferenced results of the 2010 Census in Mexico to train machine learning algorithms to detect growth in cities and contribute new information to estimate the total population.
Sharing the experience and results of using georeferenced 2010 Census data in Mexico and EO to train algorithms in order to detect urban growth and generate useful information for estimating population for non-census years.
Geographic Information Systems (October – 2015) [Question Paper | CBSGS: 75:2...Mumbai B.Sc.IT Study
Geographic Information Systems (October – 2015) [Question Paper | CBSGS: 75:25 Pattern]
april - 2017, april - 2016, april - 2015, april - 2014, april - 2013, october - 2017, october - 2016, october - 2015, october - 2014, may - 2016, may - 2017, december - 2017, 75:25 pattern, 60:40 pattern, revised course, old course, mumbai bscit study, mumbai university, bscit semester vi, bscit question paper, old question paper, previous year question paper, semester vi question paper, question paper, CBSGS, IDOL, kamal t, internet technology, digital signals and systems, data warehousing, ipr and cyber laws, project management, geographic information system
Nas net where model learn to generate modelsKhang Pham
Walk through NAS net and a few papers applied NAS search space as well as the approach for architecture search to achieve SOTA in accuracy for ImageNet
Deep Learning Applications to Satellite Imageryrlewis48
These are the slides from Intel's AI DevCon 2018 Conference. The video from the workshop is available online.The last few years has seen a significant increase in the launch of commercial and federal satellite imaging platforms. As these data become more widely available, so too have the data science challenges and research opportunities. In this hands-on workshop, CosmiQ Works and Intel AI Lab will introduce the business use cases and research questions around leveraging this imagery, as well as helpful tools and datasets to ease the friction. We will guide attendees through a hands-on exercise using the tools to train a small network on Intel® Xeon® Processors to detect buildings or road networks using the SpaceNet™ dataset. Join us to learn how to explore this exciting area of applied deep learning.
5 Steps to Improve your Active Travel CommunicationsPindar Creative
5 Steps to Improve your Active Travel Communications by Maria Heaman of Pindar Creative focusing on primary and secondary schools.
Third revision - 19 January 2017
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS On the spectrum of the plenoptic f...IEEEBEBTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Yuki Oyama - Markov assignment for a pedestrian activity-based network design...Yuki Oyama
Oyama, Y., Hato, E., Scarinci, R., Bierlaire, M. (2017) Markov assignment for a pedestrian activity-based network design problem. The 6th symposium arranged by European Association for Research in Transportation (hEART), Haifa, Israel.
Metromaps are recognized well in visualization research but are rarely found in applied technologies. Earlier works showcased metromaps as a valid tool for human-robot hybrid learning when mining Big Data. This paper goes one step further and shows that metromaps are good for controlling complexity in search/state space. To accomplish this, any generic context is represented as train lines and stations, where stations can be shared by one or more train lines. Complexity is controlled by focusing on a given station and defining resolution in terms of hop-length on e2e paths to other stations in the overall metromap.
The RGB mosaic system is an automatic tool to generate a mosaic over the Alps that provides the last SENTINEL-2 color images available, with a cloud coverage of less than 30% of the entire scene. The system is composed of open source software for processing and visualization tasks and the Data Exchange Server tool (DES) tool, developed by Eurac Research, to automatize server-based tasks. The processing of the images uses a python code and GDAL libraries to estimate cloud and no-data coverage in each tile and to optimize the images for a better visualization. The DES is useful to automatize images downloading, to run the processing tasks in the right order and to update the mosaic and metadata when the processing is over. The result is shared through a WMS (Web Map Service) layer and in a web page for a quick look.
An accurate retrieval through R-MAC+ descriptors for landmark recognitionFederico Magliani
The landmark recognition problem is far from being solved, but with the use of features extracted from intermediate layers of Convolutional Neural Networks (CNNs), excellent results have been obtained. In this work, we propose some improvements on the creation of R-MAC descriptors in order to make the newly-proposed R-MAC+ descriptors more representative than the previous ones. However, the main contribution of this paper is a novel retrieval technique, that exploits the fine representativeness of the MAC descriptors of the database images. Using this descriptors called "db regions" during the retrieval stage, the performance is greatly improved. The proposed method is tested on different public datasets: Oxford5k, Paris6k and Holidays. It outperforms the state-of-the- art results on Holidays and reached excellent results on Oxford5k and Paris6k, overcame only by approaches based on fine-tuning strategies.
"Cross-Year Multi-Modal Image Retrieval Using Siamese Networks" by Margarita Khokhlova, Research Scientist (Post-Doc) at LIRIS
Abstract: Alegoria project aims to create content-based image retrieval (CBIR) tools to help end-users accessing great volumes of archive images of French territories which were recently digitized. The difficulty is that many photographic materials are scarcely, or not at all annotated, which makes it hard to link them to modern photographic images of the same territory. In this talk, I am going to present a new custom Siamese architecture for a cross-time multi-modal aerial image retrieval scenario and talk about single-shot and contrastive learning approaches.
Speaker biography: Margarita Khokhlova is a postdoc researcher at the IGN Saint-Mande affiliated with LIRIS Lyon. Her primary area of expertise is computer vision. She is currently working on deep learning-based methods for unsupervised multi-modal image description and retrieval. She obtained a Ph.D. degree from the University of Burgundy in 2018, where her dissertation was dedicated to automatic gait analysis using 3D active sensors. She also holds two separate master's degrees. The first is a joint degree in computer vision from the University of Lyon, France and NTNU Norway. The second is in business management administration from the University of Burgundy Dijon. Her research interests include computer vision, deep learning, and data analysis.
Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lens...inside-BigData.com
In this deck from the 2018 Swiss HPC Conference, Gilles Fourestey from EPFL presents: Scratch to Supercomputers: Bottoms-up Build of Large-scale Computational Lensing Software.
"LENSTOOL is a gravitational lensing software that models mass distribution of galaxies and clusters. It was developed by Prof. Kneib, head of the LASTRO lab at EPFL, et al., starting from 1996. It is used to obtain sub-percent precision measurements of the total mass in galaxy clusters and constrain the dark matter self-interaction cross-section, a crucial ingredient to understanding its nature.
However, LENSTOOL lacks efficient vectorization and only uses OpenMP, which limits its execution to one node and can lead to execution times that exceed several months. Therefore, the LASTRO and the EPFL HPC group decided to rewrite the code from scratch and in order to minimize risk and maximize performance, a bottom-up approach that focuses on exposing parallelism at hardware and instruction levels was used. The result is a high performance code, fully vectorized on Xeon, Xeon Phis and GPUs that currently scales up to hundreds of nodes on CSCS’ Piz Daint, one of the fastest supercomputers in the world."
Watch the video: https://wp.me/p3RLHQ-ili
Learn more: https://infoscience.epfl.ch/record/234382/files/EPFL_TH8338.pdf?subformat=pdfa
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to Processing Large ToF-SIMS Datasets (20)
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
1. Processing Large ToF-SIMS Datasets
Wednesday, 20 September 2017 1
Gustavo Ferraz Trindade, Marie-Laure Abel and John F. Watts
The Surface Analysis Laboratory, University of Surrey, UK
3. Wednesday, 20 September 2017 3
ToF-SIMS data is “growing up”
Introduction
Most surface analysis laboratories ToF-SIMS spectrometers in dual beam depth profile mode
will typically generate hyperspectral image datasets distributed throughout a 3D cube
containing more than 256 x 256 x 500 voxels with each voxel containing from 20,000 to
2,000,000 spectral channels.
4. Wednesday, 20 September 2017 4
ToF-SIMS data analysis is “growing up”
Keywords “SIMS” and “PCA” @ web of science
Introduction
5. Wednesday, 20 September 2017
5
Binning voxels and channels, Peak picking, standard approaches
Surrey Matlab GUI developed by Gustavo Ferraz Trindade
Introduction
s i m s M V A
www.mvatools.com
6. Wednesday, 20 September 2017 7
New trend in Surface Analysis community of processing full datasets
- Random vectors algorithm + GPU
- Focus on PCA only
Introduction
Surf. Interface. Anal 2016
10.1002/sia.6042
Surf. Interface. Anal 2015
10.1002/sia.5800
7. Wednesday, 20 September 2017 8
My contribution/objective: perform Non-negative matrix
factorisation (NMF aka MCR) on unbinned datasets
Introduction
8. Wednesday, 20 September 2017 9
Example dataset
Surface segregation of polymer additives X
Large area scan of chemically contaminated fingerprint on silicon wafer
- Great interest from forensics
- Surrey has experience in it
Experimental
Analyst, 2015, 140, 6254
Analyst, 2013, 138, 6246
Surf. Interface Anal. 2010, 42, 826–829
9. Wednesday, 20 September 2017 10
Each patch will have 100 x 100 pixels (500 x 500 um2)
20 patches were done in a total area of 1 x 1 cm2 (pixel density 0.06 px/um2 )
Each spectrum has 2,000,000 channels
Resulting dataset has 4M x 2M = 8x1012 data points!
Extremely sparse (< 1% non-zero elements)
Great challenge for multivariate analysis
Iontof TOF.SIMS 5
High current bunched mode
25 keV Bi3
+
0.3 pA, 10 kHz
Negative secondary ions
10 scans per patch
Experimental
10. Wednesday, 20 September 2017 11
General Raw Data (.GRD)
Scan x y tof
Directly loading into pre-allocated sparse arrays in
Matlab 2016a
Resulting data is arranged in matrix A sized 4M x 2M
containing the 4M spectra of every single pixel, with 2M
spectral channels each.
Processing
11. Wednesday, 20 September 2017 12
the method of choice was
Non-negative matrix factorisation (NMF) a.k.a. MCR
Multiplicative update algorithms (Lee & Seung - 2001)
Processing
NATURE|VOL 401 | 1999 |
12. Wednesday, 20 September 2017 13
A (NxM) W*H
= =
(+ error)
(2 “pure components”)
(3 “pure components”)(4 “pure components”)(5 “pure components”)
and so on…
Weights
Weights
Pure spectrum Pure spectrum
+
Processing
13. Wednesday, 20 September 2017 14
To overcome time and memory limitations:
Sub-sampling using Sobol sequences
Processing
Surf. Interface. Anal 2016
14. Wednesday, 20 September 2017 15
Results
Component 1
Component 3
Component 2
265 u: Sodium dodecyl sulphate
SO2
-
SO3
-
SO4H- C29H28O4
-
NaS2O7-
OH-
SiO2
-
Data size: 4M x 1.3M
Subsample size: 15,000 x 1.3M
Iterations: 500
Time/iteration: 36s
FOV: 1 cm x 1cm
15. Wednesday, 20 September 2017 16
Spectrum of a single pixel
In spite of the fact that the dataset has
very few counts per pixel, NMF was
successfully achieved.
Advantage of performing multivariate
analysis on noisy, very large datasets.
A pixel by pixel view will not contain
relevant information but the whole data
would still have latent structure and be
able to undergo factorisation without
binning.
Results
16. Wednesday, 20 September 2017
17
Since the secondary ions analysed were
negatively charged, the Si- and SiO- peaks have
very low intensity.
Even so NMF managed to separate them
perfectly from the fingerprint signal
Reinforces the advantage of using unbinned
datasets when it comes to finding hidden
features.
Si-
SiO-
Results
17. Wednesday, 20 September 2017
18
Systematic misalignment of ALL peaks for
components 1 and 2
- Topography of deposited fingerprint?
- Non-perfect primary ions TOF correction?
Image
zoomed in
on 9 patches
Results
SO3
-
18. Wednesday, 20 September 2017 19
To overcome misalignment problem
- Better sample preparation
- Review primary ions tof correction
- Data based only methods:
Align channel by channel to
a reference pixel (warping)
Time consuming. Quickest found method
takes minutes per spectrum
Apply fixed shift
(misalignment due to height differences)
Only a few counts per pixel. Impossible to
identify peak positions
Results
19. Wednesday, 20 September 2017 20
Third approach for alignment (that would not need good statistics per pixel)
- Perform alignment on NMF components (matrix H) and reconstruct back
𝐴
𝑁𝑀𝐹
= 𝑊𝐻 + 𝑒𝑟𝑟𝑜𝑟
𝐴𝐿𝐼𝐺𝑁𝑀𝐸𝑁𝑇
𝐻𝐴 = H + S
𝑅𝐸𝐶𝑂𝑁𝑆𝑇𝑅𝑈𝐶𝑇𝐼𝑂𝑁
𝐴 𝐴 = 𝑊𝐻𝐴 + 𝑒𝑟𝑟𝑜𝑟 = 𝑊𝐻 + 𝑊𝑆 + 𝑒𝑟𝑟𝑜𝑟 = 𝐴 + 𝑊𝑆
AA = A + WS
Correction matrix for A would be shift matrix “S” (obtained from matrix H) weighted by
relative concentrations of pure components (matrix W)
- It seems to work with “simulated data” but we are still not sure whether it is mathematically
correct
- Small problem: this would require to process the entire matrix A (no subsampling)
Results
A
W
H HA
AA
(aligned)
NMF
Alignment
Reconstruction
NMF
(again)
20. Wednesday, 20 September 2017 21
- It seems to work with “simulated data” but we are still not sure whether it is appropriate
Results
H Matrix before and after alignment W Matrix (overlay of 3 components) before alignment
and after alignment + reconstruction + NMF
- Small problem: this would require to process the entire matrix A (no subsampling)
Before
After
21. Wednesday, 20 September 2017 22
Results
H Matrix W Matrix (overlay of 3 components)
Before
alignment
After
alignment
22. Wednesday, 20 September 2017
Good approach for NMF of sparse
giant matrices: Map/Reduce
- Introduced by Google in 2004
- Added to Matlab in version 2014b
- Still used in several Big Data
applications
Map/Reduce
Analyse full dataset
Data won’t fit in PCs memory: Requires different method
OSDI 2004
24. Wednesday, 20 September 2017 25
- Map/Reduce NMF
- Multiplicative update algorithm
in map/reduce framework
- Implementation in Matlab R2016a:
challenge due to lack of documentation
Map/Reduce
Proceedings of the 19th international conference on
World wide web WWW 10
25. Wednesday, 20 September 2017 26
History of implementations in Matlab
Time per iteration (4 workers) x number of elements x sparsity
Same dataset
~ 10x faster
There is room for
improvement!!
Map/Reduce
26. Wednesday, 20 September 2017 27
Comparison between map/reduce and standard NMF
Adhesive sample
Data 32x32x20000, 150 iterations, same IC
Map/Reduce Standard
Map/Reduce
27. Wednesday, 20 September 2017 28
Conclusions!?
ToF-SIMS data will not stop growing and we have to
consider ways to go about processing it
NMF of a large ToF-SIMS dataset has been achieved with
sparse allocation and subsampling
Hidden features and weak signals can be identified
when unbinned datasets are processed
For even larger datasets or to align peaks via reconstruction:
MapReduce may be a way to go
Deals with data in chunks
well defined framework
easily scalable up to large computer clusters
NPL 3D Nano SIMS