SlideShare a Scribd company logo
1 of 12
Download to read offline
1
Gooey data sets: high throughput structural data from complex amphiphiles
Charlotte Broadbent
Civil Engineering, Columbia University, New York, NY 10027
Elaine DiMasi
Photon Sciences Department, Brookhaven National Lab, Upton, NY 11973
2
Abstract
Environmental awareness among consumers has prompted many industries, including
cosmetic companies, to turn to “green” alternatives for their products. Complex amphiphiles,
self-assembled structures of surfactant molecules, give products such as shampoo and liquid
hand soap their physical properties; new formulations can be both physically benign and
environmentally safe to synthesize. In July 2012, in order to determine the molecular properties
of certain complex amphiphiles under varying conditions, small angle x-ray scattering (SAXS)
patterns of over 150 samples were obtained at the National Synchrotron Light Source (NSLS)
X6B beamline. Current techniques (manually treating one data frame at a time) for analyzing
such data are insufficient for the large amount of data that has been acquired; and so, a much
faster method is necessary. This project created a method for consistently treating, visualizing,
and statistically analyzing these large data sets. Using tools such as Python, we developed large-
scale processes in Datasqueeze, a program specifically designed for SAXS images, and
MATLAB, whose graphing and imaging capabilities we utilized. This will enable us to finally
analyze the July 2012 data in an automatic, time-efficient manner. Such techniques can then be
applied not only to studies on complex amphiphiles, but also any other SAXS study at NSLS-II
or similar facility.
3
I. Introduction
Over the past decade, many industries have turned to “green” alternatives for their
products in response to increased environmental awareness among consumers. Lubrizol Corp. is
no exception to this. They produce chemicals that are sold to companies in the personal care
industry, for products such as shampoos, liquid hand soaps, and similar products. Complex
amphiphiles, self-assembled structures of surfactant molecules, give these products several
physical-chemical properties, such as rheology and clarity, and could reduce skin and eye
irritation. In order to test new formulations of complex amphiphiles that are both physically
benign and environmentally safe to synthesize, Lubrizol was interested in using SAXS imaging,
a fast method of determining the structure, and by extension the physical properties, of the new
formulations. In July 2012, representatives from Lubrizol took SAXS images at the NSLS X6B
beamline of over 150 samples of complex amphiphiles; this high-throughput data was enabled by
well-plates (see Figure 1), which could hold many samples at once during automated scans.
Figure 1. Example of a well-plate used in SAXS scans. Holes in the Teflon block are covered by mylar film to
contain liquid and gel samples, then plates clamp the assembly together.
4
However, the July 2012 experiments produced too much data for current techniques to
handle. These current techniques involve manually unwarping (a process to correct from
distortion from the detector) each individual frame, extracting the relevant information, and then
using that information. Not only were there over 150 samples1
, but the experiments produced
over 300 data frames, in part due to the beamline technology; automated scans would scan every
well in a well-plate (or row of a well-plate), regardless of content (see Figure 1). Therefore, the
data from this experiment has been left largely untouched due to the huge amount of time that
would have been necessary to thoroughly analyze it.
The main focus of this project was to develop a method to handle the large amount of
SAXS images from the July 2012 Lubrizol experiments, a process which can hopefully be
applied to other SAXS experiments involving high throughput data.
II. Methods
Thorough analysis of the data first required thorough knowledge of the data. Before any
automated programs were created, I made a spreadsheet so that most of the information was in
the same place. The first column contained the file name; the second, a 1 if the file corresponded
to a sample, a 0 if the file was a background; and the third, the sample name if the previous
column contained a 1. Several other columns were also added, but were to be filled later with the
help of Python.
Secondly, since the detector that was used to take this data did not unwarp the data
automatically, we had to decide whether or not it was necessary. To do this, plots of intensity
versus q were made for two different frames of Silver Behenate (used to calibrate detector
position in the Datasqueeze2
software; see Figure 2): one that was raw data, and one that had
5
been unwarped. Significant differences in the two graphs at higher q ranges told us that
unwarping was necessary (See Figure 3). We then used Python to automatically unwarp all of the
files, by way of a special script created for the beamline.
The next step in the progression was to create a MATLAB program that could plot SAXS
images. We decided to create a program that specifically plotted them based on their well-plate
position; this decision was made in part because we didn’t know relevant information about the
samples (such as surfactant and salt concentration) until late in the experiment. The well-plate
position, however, was contained in the file header. Python code was created to extract this
information, and then import it into the master spreadsheet. From the spreadsheet, it was
imported into MATLAB, where the program I created plotted them. This was done for all
eighteen well-plates.
After that, the main Python code was written to create a “data dictionary”. This code first
used a directory search to find all the file names of the raw data, and for these files extracted the
relevant information from the file header: the well-plate position and the x-ray monitor counts
(this information was also imported to the master spreadsheet). Then, the code read the master
Figure 2. Bragg-rings of Silver
Behenate (AgBe). Known d-spacings
are used to calibrate detector position
in Datasqueeze Software.
Figure 3. AgBe raw data (blue) versus unwarped (red). The
differences in the peaks at higher Q indicate that all the images
must be unwarped in order to maintain accuracy.
6
spreadsheet to assign appropriate values to the variables “background” (true or false) and
“sample” (the name of the actual sample). Then, all the monitor count values were averaged, and
a new variable, “normalized”, was created, which divided the average value by the value of that
specific data frame. Lastly, for those data frames that corresponded to samples, the “associated
background” was a background frame on that same well plate. For each data frame, all of these
variables were saved in a dictionary.
This data dictionary could then be used to obtain the desired results in Datasqueeze.
Python was used to create a batch file for Datasqueeze which read in each data frame that
corresponded to a sample, normalized it, and subtracted its normalized associated background.
We were thus left with diffraction data from only the sample itself. Using this data, plots of
intensity versus q were made, for the whole data frame as well as “slices” of the pattern, in
addition to plots of intensity versus chi (angle) to check for anisotropic samples. Fits of these
peaks are currently in the process of being made.
III. Results
Figure 4.
Example of one
of the 18
different well-
plate scans.
7
Figure 4 shows an example of one of the outputs of the MATLAB program. Figure 5 shows
the more than 350 SAXS images distributed over eighteen well-plates. This program allows
large-scale visualization of the data so that immediate conclusions can be drawn.
Figure 5. Five different sets of surfactant solutions distributed across 18 well-plates.
Figure 6 shows an example of one of the “data dictionaries”. The dictionaries can be
stored in a file which can be read in to any other Python program, thereby allowing the user to
utilize any aspect of the data.
8
Figure 6. Example of a dictionary for one SAXS image.
Figure 7 shows an example of a normalized, background-subtracted isotropic sample.
Figure 8 shows the same image for an anisotropic sample. This particular case is one of the few
of all samples that showed significant anisotropy (non-uniform scattering patterns). Figure 9 is
an example of an intensity versus q graph for an isotropic sample; Figure 10 is the same for an
anisotropic sample. Based on these two graphs, one cannot distinguish between isotropic and
anisotropic. So, although these are particularly important in the analysis of the data, as they
reveal information about the structure of the amphiphiles, another method is needed to identify
the anisotropic samples. Figure 11 shows the plot of intensity versus chi, averaged over q, for the
isotropic sample; Figure 12, the same for the anisotropic. Notice the peaks in Figure 12 where
the sample has higher intensity at that angle, versus the relatively flat line in Figure 11.
9
Figure 7. Example of an isotropic sample that has
been normalized and had the background
subtracted.
Figure 8. Example of an anisotropic sample that
has been normalized and had the background
subtracted.
Figure 9 (above). Example of an intensity versus q
plot for an isotropic sample.
Figure 10 (below). Example of an intensity
versus q plot for an anisotropic sample.
10
IV. Discussion
The development of these programs results in the capability of high throughput SAXS
data analysis. The MATLAB program allows visualization of many data frames at once, in our
case as many as seventy-two, so that immediate conclusions can be drawn. This program can
also be slightly modified so that instead of being plotted by well-plate, the samples are plotted by
surfactant concentration versus salt concentration, or other relevant variables, so that the impact
on the diffraction patterns is obvious.
Figure 11 (above). Example of a plot of intensity
versus chi for isotropic sample. No significant peak
shows that the sample is isotropic.
Figure 12 (below). Example of a plot of intensity
versus chi for anisotropic sample. Significant peaks
show that the sample is anisotropic.
11
While this project has come far, there are still several steps that need to be taken to ensure
thorough analysis of the data. The fit parameters of the plots need to be ascertained, and then
added to the dictionary. Most importantly, the data needs to actually be analyzed, a task that is
made considerably easier by these programs. The SAXS data can be used to study the
morphology of these surfactant micelles and phase behavior in aqueous solutions. The
contribution to I(q) (see Figures 9 and 10) arising from the micellar electron density (Figure 13
(b)) is termed the form factor; the contribution to I(q) from the variation in electron density in
ordered domains (Figure 13 (c)) is referred to as the structure factor. The two are combined in
the
“interaction peak” (Figure 143
). While the interaction peak will always contain a form factor
peak for any sample of amphiphiles, the degree of the prominence of the structure factor peak is
what will vary significantly between samples and what can provide the most insight into the
properties of the material for application to industry.
V. References
1
All samples were provided by Lubrizol Corporation (Ohio).
2
Heiney, Paul A. Datasqueeze. Computer software. Datasqueeze Software. Vers. 3.0.4.
N.p., 7 Feb. 2015. Web. 27 July 2015.
Figure 13. (a) Self assembled amphiphiles in water. (b) Across a
micelle, the electron density ρ(r) varies in regions dominated by
water, denser headgroups, and slightly less dense alkyl chains. (c)
Across a domain of ordered micelles, the electron density is
periodic on the larger length scale of the micelle spacing.
Figure 14. Scattering curve that
illustrates typical contributions of
form factor (dotted line) and structure
factor (dashed line)3
.
12
3
Itri, R., and L. Q. Amaral. "Micellar-shape Anisometry near Isotropic–liquid-crystal
Phase Transitions." Physical Review E Phys. Rev. E 47.4 (1993): 2551-557. Print.
VI. Acknowledgements
This project was supported in part by the Brookhaven National Laboratory (BNL) Photon
Sciences Department under the BNL Supplemental Undergraduate Research Program (SURP)
(U. S. Department of Energy contract numbers DE-AC02-98CH10886 and DE-SC0012704). I
would also like to thank Vesna Stanic of LNLS and Ramiro Galleguilos of Lubrizol Corp. for
taking the original data and providing me with some essential background information about the
project. Lastly, I would like to thank my mentor, Elaine DiMasi, for providing me with the
opportunity to work on this project and giving me guidance.

More Related Content

What's hot

HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 
Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Debaleena Chattopadhyay
 
2007-10-16 HTAP Juelich
2007-10-16 HTAP Juelich2007-10-16 HTAP Juelich
2007-10-16 HTAP JuelichRudolf Husar
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
Development Infographic
Development InfographicDevelopment Infographic
Development InfographicRealMassive
 
Machine learning astronomical structure
Machine learning astronomical structureMachine learning astronomical structure
Machine learning astronomical structurePanditNitesh
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?Robert Grossman
 

What's hot (20)

HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories Exploiting Hierarchical Context on a Large Database of Object Categories
Exploiting Hierarchical Context on a Large Database of Object Categories
 
HEPData
HEPDataHEPData
HEPData
 
2007-10-16 HTAP Juelich
2007-10-16 HTAP Juelich2007-10-16 HTAP Juelich
2007-10-16 HTAP Juelich
 
Slide 1
Slide 1Slide 1
Slide 1
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
Development Infographic
Development InfographicDevelopment Infographic
Development Infographic
 
Machine learning astronomical structure
Machine learning astronomical structureMachine learning astronomical structure
Machine learning astronomical structure
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 

Viewers also liked

Caminhoverdadevidal20
Caminhoverdadevidal20Caminhoverdadevidal20
Caminhoverdadevidal20Manoel Gamas
 
Michael Cortez Resume_2016
Michael Cortez Resume_2016Michael Cortez Resume_2016
Michael Cortez Resume_2016Michael Cortez
 
Title sequence formal conventions
Title sequence formal conventionsTitle sequence formal conventions
Title sequence formal conventionsPablo Guembe-Young
 
Verifying the role of AID in Chronic Lymphocytic Leukemia
Verifying the role of AID in Chronic Lymphocytic LeukemiaVerifying the role of AID in Chronic Lymphocytic Leukemia
Verifying the role of AID in Chronic Lymphocytic LeukemiaCharlotte Broadbent
 
Principios de contabilidad generalmente aceptados
Principios de contabilidad generalmente aceptadosPrincipios de contabilidad generalmente aceptados
Principios de contabilidad generalmente aceptadosmanue0606
 
Expresarme es mi derecho
Expresarme es mi derecho Expresarme es mi derecho
Expresarme es mi derecho JuanPabloZ20
 
Rizzo's Resume 2016-April
Rizzo's Resume 2016-AprilRizzo's Resume 2016-April
Rizzo's Resume 2016-AprilMike Rizzo
 
LiTHIUM X - Corporate Presentation
LiTHIUM X - Corporate PresentationLiTHIUM X - Corporate Presentation
LiTHIUM X - Corporate PresentationLiTHIUM X Corp
 
Presentation oppermann agriculture
Presentation oppermann agriculturePresentation oppermann agriculture
Presentation oppermann agricultureMEHEDI HASAN
 

Viewers also liked (17)

fat_muscle_NIMA_2015
fat_muscle_NIMA_2015fat_muscle_NIMA_2015
fat_muscle_NIMA_2015
 
Caminhoverdadevidal20
Caminhoverdadevidal20Caminhoverdadevidal20
Caminhoverdadevidal20
 
Lefentse Sennelo - CV
Lefentse Sennelo - CVLefentse Sennelo - CV
Lefentse Sennelo - CV
 
Michael Cortez Resume_2016
Michael Cortez Resume_2016Michael Cortez Resume_2016
Michael Cortez Resume_2016
 
Matematica
Matematica Matematica
Matematica
 
Brochura_MBhotels_VersaoInglesa
Brochura_MBhotels_VersaoInglesaBrochura_MBhotels_VersaoInglesa
Brochura_MBhotels_VersaoInglesa
 
O espirito
O espiritoO espirito
O espirito
 
GrantProposal
GrantProposalGrantProposal
GrantProposal
 
Genre research
Genre researchGenre research
Genre research
 
Title sequence formal conventions
Title sequence formal conventionsTitle sequence formal conventions
Title sequence formal conventions
 
Questionnaire for Teacher
Questionnaire for TeacherQuestionnaire for Teacher
Questionnaire for Teacher
 
Verifying the role of AID in Chronic Lymphocytic Leukemia
Verifying the role of AID in Chronic Lymphocytic LeukemiaVerifying the role of AID in Chronic Lymphocytic Leukemia
Verifying the role of AID in Chronic Lymphocytic Leukemia
 
Principios de contabilidad generalmente aceptados
Principios de contabilidad generalmente aceptadosPrincipios de contabilidad generalmente aceptados
Principios de contabilidad generalmente aceptados
 
Expresarme es mi derecho
Expresarme es mi derecho Expresarme es mi derecho
Expresarme es mi derecho
 
Rizzo's Resume 2016-April
Rizzo's Resume 2016-AprilRizzo's Resume 2016-April
Rizzo's Resume 2016-April
 
LiTHIUM X - Corporate Presentation
LiTHIUM X - Corporate PresentationLiTHIUM X - Corporate Presentation
LiTHIUM X - Corporate Presentation
 
Presentation oppermann agriculture
Presentation oppermann agriculturePresentation oppermann agriculture
Presentation oppermann agriculture
 

Similar to Gooey data sets

IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...AIRCC Publishing Corporation
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...AIRCC Publishing Corporation
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ijaia
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsVijay Karan
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsVijay Karan
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streamsirjes
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelProbabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelNick Malleson
 
2016 Summer Fellowship Report R
2016 Summer Fellowship Report R2016 Summer Fellowship Report R
2016 Summer Fellowship Report RMegan R. Murphy
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
 
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Sri Ambati
 

Similar to Gooey data sets (20)

IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
IEEE 2014 Matlab Projects
IEEE 2014 Matlab ProjectsIEEE 2014 Matlab Projects
IEEE 2014 Matlab Projects
 
SciVisHalosFinalPaper
SciVisHalosFinalPaperSciVisHalosFinalPaper
SciVisHalosFinalPaper
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
 
NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
NMR Chemical Shift Prediction by Atomic Increment-Based AlgorithmsNMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
NMR Chemical Shift Prediction by Atomic Increment-Based Algorithms
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based ModelProbabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
Probabilistic Programming for Dynamic Data Assimilation on an Agent-Based Model
 
2016 Summer Fellowship Report R
2016 Summer Fellowship Report R2016 Summer Fellowship Report R
2016 Summer Fellowship Report R
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
Achieving Algorithmic Transparency with Shapley Additive Explanations (H2O Lo...
 
Pca seminar final report
Pca seminar final reportPca seminar final report
Pca seminar final report
 

Gooey data sets

  • 1. 1 Gooey data sets: high throughput structural data from complex amphiphiles Charlotte Broadbent Civil Engineering, Columbia University, New York, NY 10027 Elaine DiMasi Photon Sciences Department, Brookhaven National Lab, Upton, NY 11973
  • 2. 2 Abstract Environmental awareness among consumers has prompted many industries, including cosmetic companies, to turn to “green” alternatives for their products. Complex amphiphiles, self-assembled structures of surfactant molecules, give products such as shampoo and liquid hand soap their physical properties; new formulations can be both physically benign and environmentally safe to synthesize. In July 2012, in order to determine the molecular properties of certain complex amphiphiles under varying conditions, small angle x-ray scattering (SAXS) patterns of over 150 samples were obtained at the National Synchrotron Light Source (NSLS) X6B beamline. Current techniques (manually treating one data frame at a time) for analyzing such data are insufficient for the large amount of data that has been acquired; and so, a much faster method is necessary. This project created a method for consistently treating, visualizing, and statistically analyzing these large data sets. Using tools such as Python, we developed large- scale processes in Datasqueeze, a program specifically designed for SAXS images, and MATLAB, whose graphing and imaging capabilities we utilized. This will enable us to finally analyze the July 2012 data in an automatic, time-efficient manner. Such techniques can then be applied not only to studies on complex amphiphiles, but also any other SAXS study at NSLS-II or similar facility.
  • 3. 3 I. Introduction Over the past decade, many industries have turned to “green” alternatives for their products in response to increased environmental awareness among consumers. Lubrizol Corp. is no exception to this. They produce chemicals that are sold to companies in the personal care industry, for products such as shampoos, liquid hand soaps, and similar products. Complex amphiphiles, self-assembled structures of surfactant molecules, give these products several physical-chemical properties, such as rheology and clarity, and could reduce skin and eye irritation. In order to test new formulations of complex amphiphiles that are both physically benign and environmentally safe to synthesize, Lubrizol was interested in using SAXS imaging, a fast method of determining the structure, and by extension the physical properties, of the new formulations. In July 2012, representatives from Lubrizol took SAXS images at the NSLS X6B beamline of over 150 samples of complex amphiphiles; this high-throughput data was enabled by well-plates (see Figure 1), which could hold many samples at once during automated scans. Figure 1. Example of a well-plate used in SAXS scans. Holes in the Teflon block are covered by mylar film to contain liquid and gel samples, then plates clamp the assembly together.
  • 4. 4 However, the July 2012 experiments produced too much data for current techniques to handle. These current techniques involve manually unwarping (a process to correct from distortion from the detector) each individual frame, extracting the relevant information, and then using that information. Not only were there over 150 samples1 , but the experiments produced over 300 data frames, in part due to the beamline technology; automated scans would scan every well in a well-plate (or row of a well-plate), regardless of content (see Figure 1). Therefore, the data from this experiment has been left largely untouched due to the huge amount of time that would have been necessary to thoroughly analyze it. The main focus of this project was to develop a method to handle the large amount of SAXS images from the July 2012 Lubrizol experiments, a process which can hopefully be applied to other SAXS experiments involving high throughput data. II. Methods Thorough analysis of the data first required thorough knowledge of the data. Before any automated programs were created, I made a spreadsheet so that most of the information was in the same place. The first column contained the file name; the second, a 1 if the file corresponded to a sample, a 0 if the file was a background; and the third, the sample name if the previous column contained a 1. Several other columns were also added, but were to be filled later with the help of Python. Secondly, since the detector that was used to take this data did not unwarp the data automatically, we had to decide whether or not it was necessary. To do this, plots of intensity versus q were made for two different frames of Silver Behenate (used to calibrate detector position in the Datasqueeze2 software; see Figure 2): one that was raw data, and one that had
  • 5. 5 been unwarped. Significant differences in the two graphs at higher q ranges told us that unwarping was necessary (See Figure 3). We then used Python to automatically unwarp all of the files, by way of a special script created for the beamline. The next step in the progression was to create a MATLAB program that could plot SAXS images. We decided to create a program that specifically plotted them based on their well-plate position; this decision was made in part because we didn’t know relevant information about the samples (such as surfactant and salt concentration) until late in the experiment. The well-plate position, however, was contained in the file header. Python code was created to extract this information, and then import it into the master spreadsheet. From the spreadsheet, it was imported into MATLAB, where the program I created plotted them. This was done for all eighteen well-plates. After that, the main Python code was written to create a “data dictionary”. This code first used a directory search to find all the file names of the raw data, and for these files extracted the relevant information from the file header: the well-plate position and the x-ray monitor counts (this information was also imported to the master spreadsheet). Then, the code read the master Figure 2. Bragg-rings of Silver Behenate (AgBe). Known d-spacings are used to calibrate detector position in Datasqueeze Software. Figure 3. AgBe raw data (blue) versus unwarped (red). The differences in the peaks at higher Q indicate that all the images must be unwarped in order to maintain accuracy.
  • 6. 6 spreadsheet to assign appropriate values to the variables “background” (true or false) and “sample” (the name of the actual sample). Then, all the monitor count values were averaged, and a new variable, “normalized”, was created, which divided the average value by the value of that specific data frame. Lastly, for those data frames that corresponded to samples, the “associated background” was a background frame on that same well plate. For each data frame, all of these variables were saved in a dictionary. This data dictionary could then be used to obtain the desired results in Datasqueeze. Python was used to create a batch file for Datasqueeze which read in each data frame that corresponded to a sample, normalized it, and subtracted its normalized associated background. We were thus left with diffraction data from only the sample itself. Using this data, plots of intensity versus q were made, for the whole data frame as well as “slices” of the pattern, in addition to plots of intensity versus chi (angle) to check for anisotropic samples. Fits of these peaks are currently in the process of being made. III. Results Figure 4. Example of one of the 18 different well- plate scans.
  • 7. 7 Figure 4 shows an example of one of the outputs of the MATLAB program. Figure 5 shows the more than 350 SAXS images distributed over eighteen well-plates. This program allows large-scale visualization of the data so that immediate conclusions can be drawn. Figure 5. Five different sets of surfactant solutions distributed across 18 well-plates. Figure 6 shows an example of one of the “data dictionaries”. The dictionaries can be stored in a file which can be read in to any other Python program, thereby allowing the user to utilize any aspect of the data.
  • 8. 8 Figure 6. Example of a dictionary for one SAXS image. Figure 7 shows an example of a normalized, background-subtracted isotropic sample. Figure 8 shows the same image for an anisotropic sample. This particular case is one of the few of all samples that showed significant anisotropy (non-uniform scattering patterns). Figure 9 is an example of an intensity versus q graph for an isotropic sample; Figure 10 is the same for an anisotropic sample. Based on these two graphs, one cannot distinguish between isotropic and anisotropic. So, although these are particularly important in the analysis of the data, as they reveal information about the structure of the amphiphiles, another method is needed to identify the anisotropic samples. Figure 11 shows the plot of intensity versus chi, averaged over q, for the isotropic sample; Figure 12, the same for the anisotropic. Notice the peaks in Figure 12 where the sample has higher intensity at that angle, versus the relatively flat line in Figure 11.
  • 9. 9 Figure 7. Example of an isotropic sample that has been normalized and had the background subtracted. Figure 8. Example of an anisotropic sample that has been normalized and had the background subtracted. Figure 9 (above). Example of an intensity versus q plot for an isotropic sample. Figure 10 (below). Example of an intensity versus q plot for an anisotropic sample.
  • 10. 10 IV. Discussion The development of these programs results in the capability of high throughput SAXS data analysis. The MATLAB program allows visualization of many data frames at once, in our case as many as seventy-two, so that immediate conclusions can be drawn. This program can also be slightly modified so that instead of being plotted by well-plate, the samples are plotted by surfactant concentration versus salt concentration, or other relevant variables, so that the impact on the diffraction patterns is obvious. Figure 11 (above). Example of a plot of intensity versus chi for isotropic sample. No significant peak shows that the sample is isotropic. Figure 12 (below). Example of a plot of intensity versus chi for anisotropic sample. Significant peaks show that the sample is anisotropic.
  • 11. 11 While this project has come far, there are still several steps that need to be taken to ensure thorough analysis of the data. The fit parameters of the plots need to be ascertained, and then added to the dictionary. Most importantly, the data needs to actually be analyzed, a task that is made considerably easier by these programs. The SAXS data can be used to study the morphology of these surfactant micelles and phase behavior in aqueous solutions. The contribution to I(q) (see Figures 9 and 10) arising from the micellar electron density (Figure 13 (b)) is termed the form factor; the contribution to I(q) from the variation in electron density in ordered domains (Figure 13 (c)) is referred to as the structure factor. The two are combined in the “interaction peak” (Figure 143 ). While the interaction peak will always contain a form factor peak for any sample of amphiphiles, the degree of the prominence of the structure factor peak is what will vary significantly between samples and what can provide the most insight into the properties of the material for application to industry. V. References 1 All samples were provided by Lubrizol Corporation (Ohio). 2 Heiney, Paul A. Datasqueeze. Computer software. Datasqueeze Software. Vers. 3.0.4. N.p., 7 Feb. 2015. Web. 27 July 2015. Figure 13. (a) Self assembled amphiphiles in water. (b) Across a micelle, the electron density ρ(r) varies in regions dominated by water, denser headgroups, and slightly less dense alkyl chains. (c) Across a domain of ordered micelles, the electron density is periodic on the larger length scale of the micelle spacing. Figure 14. Scattering curve that illustrates typical contributions of form factor (dotted line) and structure factor (dashed line)3 .
  • 12. 12 3 Itri, R., and L. Q. Amaral. "Micellar-shape Anisometry near Isotropic–liquid-crystal Phase Transitions." Physical Review E Phys. Rev. E 47.4 (1993): 2551-557. Print. VI. Acknowledgements This project was supported in part by the Brookhaven National Laboratory (BNL) Photon Sciences Department under the BNL Supplemental Undergraduate Research Program (SURP) (U. S. Department of Energy contract numbers DE-AC02-98CH10886 and DE-SC0012704). I would also like to thank Vesna Stanic of LNLS and Ramiro Galleguilos of Lubrizol Corp. for taking the original data and providing me with some essential background information about the project. Lastly, I would like to thank my mentor, Elaine DiMasi, for providing me with the opportunity to work on this project and giving me guidance.