Presented at the Computing for Light and Neutron Sources Technical Forum. Discusses Globus Online transfer, sharing and metadata management in the context of collaboration with Advanced Photon Source.
ICT role in 21st century education and its challenges
2013 06-21-computing-for-light-sources
1. globusonline
Globus Online for Managing
Tomography Data at APS
Rachana Ananthakrishnan
Francesco De Carlo
Argonne National Lab
2. We started with reliable, secure,
high-performance file transfer …
Data
Source
Data
Destination
User initiates
transfer request
1
Globus Online
moves and
syncs files
2
Globus Online
notifies user
3
3. … and then made it simple to share
big data off existing storage systems
Data
Source
User A selects
file(s) to
share, selects user
or group, and sets
permissions
1
Globus Online tracks
shared files; no need
to move files to cloud
storage!
2
User B logs in to
Globus Online
and accesses
shared file
3
4. Transforming data acquisition
Current
• Experimental parameters
optimized manually
• Collected data combined
with visual inspection to
confirm optimal condition
• Data reconstructed and sent
to users via external drive
• User team starts data
reduction at home institution
5. Transforming data acquisition
Envisaged
• Experimental parameters
optimized automatically
• Collected data available to
optimization programs
• Data are automatically
reconstructed, reduced, an
d shared with local and
remote participants
• User team leaves the APS
with reduced data
Current
• Experimental parameters
optimized manually
• Collected data combined
with visual inspection to
confirm optimal condition
• Data reconstructed and sent
to users via external drive
• User team starts data
reduction at home institution
6. Facility data
acquisition
Globus Online as enabler
Globus Online
transfer service
Reduced
data
Analysis/Shar
ingGlobus Online
sharing service
Globus Online
dataset service*
* In development
7. 7Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL)
collects data at
Advanced Photon
Source, renders at
PNNL, and views at
ANL
8. Looking at how researchers use data
• A single research question often requires the
integration of many data elements, that are:
– In different locations
– In different formats (Excel, text, CDF, HDF, …)
– Described in different ways
• Best grouping can vary during investigation
– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit
– Share, annotate, process, copy, archive, …
9. How do we manage data today?
• Often, a curious mix of ad hoc methods
– Organize in directories using file and directory
naming conventions
– Capture status in README
files, spreadsheets, notebooks
– Even PowerPoint!
• Time-consuming, complex, error prone
Why can’t we manage our data like we
manage our pictures and music?
10. Introducing the dataset
• Group data based on use, not location
– Logical grouping to organize, reorganize, search, and
describe usage
• Tagwith characteristics that reflect content …
– Capture as much existing information as we can
• …or to reflect current status in investigation
– Stage of processing, provenance, validation, ..
• Sharedata sets for collaboration
– Control access to data and metadata
• Operateon datasets as units
– Copy, export, analyze, tag, archive, …
11. Expanding Globus Online services
• Ingest and publication
– Imagine a DropBox that not only replicates, but
also extracts metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined
and/or automatically extracted metadata
• Integration with computation
– Associate computational
procedures, orchestrate application, catalog
results, record provenance
12. Builds on catalog as a service
Approach
• Hosted user-defined
catalogs
• Based on tag model
<subject, name, value>
• Optional schema
constraints
• Integrated with other
Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve
tags
/tagdef/
• Create, delete, retrieve
tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
13. Exemplar: APS Beamlines 32-ID & 2-BM
X-Ray imaging, tomography, ~few µm to
30 nm resolution
Currently can generate up
to 100 TB per day
< 1GB/s data rate; ~3-
5GB/s in 5-10 years
14. Storage
Image processing
(normalization, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Beamline 2-BM
~1.5um resolution
Beamline 32-ID-C
20-50 nm resolution
Image processing
(alignment, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Selection
Multi-scale
image fusion
Visual inspection
Up to 100 fps
2K x 2K, 16 bits
11 GB raw data
1,500 fps
2K x 2K, 16 bits
1 min readout
11 GB raw data
Multi-scale 3D
imaging data
fusion at APS
15. 15
APS Imaging Group
APS Software Service Group
Mathematics & Computer Science/Computation Institute
Multi-scale image
fusion
Infrastructure LDRD
System integration
Instrument & Data
Collection
Data Management Services
Mathematics &
Computer Science
Results:
Google earth style
zoom in data
navigation
Tao of Fusion LDRD
Argonne Collaborations
16.
17.
18.
19.
20.
21.
22. Timelines
• July:
– Alpha service available
• August:
– Pilot with two groups at APS
• Fall of this year:
– Pilot with few other groups at APS
– Early beta
23. Thank You
• Interested in working with us on dataset
service:
– Email: ranantha@mcs.anl.gov
• Contact: support@globusonline.org
• Website: www.globusonline.org
Editor's Notes
This image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF). The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.