Harnessing cloud computing
for high capacity analysis of
neuroimaging data
Cameron Craddock, PhD
Computational Neuroimagin...
Discovery science in Psychiatric Neuroimaging
1. Characterizing inter-individual variation in connectomes (Kelly et al.
20...
Connectomics is Big Data
Configurable Pipeline for the Analysis of
Connectomes (CPAC)
• Pipeline to automate preprocessing and analysis
of large-sc...
Computing in the Amazon Cloud
• No hardware capital cost
• No hardware maintenance
• No software installation or
configura...
Amazon EC2 - Instance
• The hardware on which your processing will
run:
Instance Pricing
• On-demand Pricing
– Always available, fixed
price, non-interruptible,
most stable
• Spot instances
– Ma...
Spot Instances
• Prices fluctuate over
time
• If price exceeds the max
you are willing to pay,
your instances are
terminat...
Storage
• S3 – Simple Storage Service
– Secure and stable storage with a web service interface, pay for what you use
– Big...
Amazon EC2 - Instance
• The hardware on which your processing will
run:
Data Transfer
• In general, free in - pay out
– Out to other Amazon service such as S3, EBS, etc
is free
– Out to Internet...
Amazon Machine Images
• Virtual machines that provide the software
environment for your processing
• You can build your ow...
StarCluster
• Star cluster simplifies the process of building a
Sun Grid Engine based cluster in EC2
– Dynamically add and...
C-PAC Amazon Machine Image
Nypipe
Proof of concept
• Preprocessed 1,112 datasets from
ABIDE with C-PAC
– 4 different preprocessing strategies
(+/- temporal ...
• Requires 45 minute to process 1 dataset
• 3 datasets can be processed in parallel
• Processing results in .5GB of data
M...
Cloud vs. Traditional Computing
0
5000
10000
15000
100
2000
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Num...
Impact of Spot Instances
Simulations using past 90 days of spot price history
Other Pipelines
What about HIPAA?
• Amazon AWS meets FedRAMP and NIST 800-53
standards, which are more rigorous than HIPAA
– Access to ins...
C-PAC Amazon Machine Image
Nypipe
Preprocessed INDI Data in the Cloud
http://preprocessed-connectomes-project.github.io/
• Available through S3
Bucket gener...
- HCP Data available in the cloud:
- https://wiki.humanconnectome.org/display/PublicData/Home
- Receive $100 AWS Credits a...
Acknowledgements
CPAC Team: Daniel Clark, Steven Giavasis and Michael Milham.
NDAR “Cloud Team”: Christian Haselgrove, Dav...
Upcoming SlideShare
Loading in …5
×

CPAC Connectome Analysis in the Cloud

265 views

Published on

Harnessing cloud computing to achieve high-throughput connectomes analysis.

Published in: Engineering
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
265
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • The goal of large-scale analyses of connectomes data are to map inter-individual variation in phenotype, such as sex, age, IQ, etc, to variation in the connectome. For clinical populations, we are particularly concerned with identifying connectome based biomarkers of disease presence, its severity, and prognosis, specifically treatment outcomes. Recently, there has been consternation about the ecological validity of psychiatric disease classifications that are based on syndromes that are described by the presence of symptoms. This provides the opportunity to redefine psychiatric populations based on the similarity of connectomes, i.e. clustering individuals based on their connectivity profiles.
  • Green boxes indicate initiatives in which data is aggregated after it is acquired, rather than centralized initiatives in which data acquisition was coordinated between sites. Since the data collection is not coordinated for these sites, the data is more heterogeneous, being collected with different parameters.
  • The ultimate goal of CPAC is to make high-throughput state-of-the-art connectomes analyses accessible to researchers who lack programming and/or neuroimaging expertise. It is currently still in alpha, with the expectation of being beta by mid 2015. It is currently limited to functional connectivity analyses, but we plan to add DTI by the end of 2015.
  • Several different methods have been proposed for preprocessing connectomes data to remove nuisance variation that obscures biological variation. Some of these methods have been shown to introduce artifacts that bias results. Rather than an single best practice, a pluralistic approach is needed, in which several different procedures are performed and the results are compared to identify those that are robust across strategies.
  • In addition to a large number of preprocessing strategies, several different methods of analyses have been proposed such as: (left to right) (top row) eigenvector and degree centrality, voxel mirrored homotopic connecitivity, fractional amplitude of low frequency fluctuations, bootstrap analysis of stable clusters, (bottom) regional homogeniety, amplitude of low frequency fluctuations, and multi-dimensional matrix regression.
  • Preprocessing large datasets using current tools requires a substantial amount of time, even if automated using scripts. Performing multiple preprocessing strategies is a multiplier for execution time, and different analysis adds to computation time. What is needed are tools that can not only automate the preprocessing, but also take advantage of parallelism inherent in the data and algorithms, to achieve high-throughput processing on high performance computing architectures. We are developing CPAC to this aim.
  • The ultimate goal of CPAC is to make high-throughput state-of-the-art connectomes analyses accessible to researchers who lack programming and/or neuroimaging expertise. It is currently still in alpha, with the expectation of being beta by mid 2015. It is currently limited to functional connectivity analyses, but we plan to add DTI by the end of 2015.
  • Most expensive on left is US-west, least expensive is us-east, second blue is SA-EAST-1c
    Longest is EU-east shortest is SA-east
  • Instead of sharing just raw data, why not share the preprocessed data as well?
  • CPAC Connectome Analysis in the Cloud

    1. 1. Harnessing cloud computing for high capacity analysis of neuroimaging data Cameron Craddock, PhD Computational Neuroimaging Lab Center for Biomedical Imaging and Neuromodulation Nathan S. Kline Institute for Psychiatric Research Center for the Developing Brain Child Mind Institute
    2. 2. Discovery science in Psychiatric Neuroimaging 1. Characterizing inter-individual variation in connectomes (Kelly et al. 2012) 2. Identifying biomarkers of disease state, severity, and prognosis (Craddock 2009) 3. Re-defining mental health in terms of neurophenotypes, e.g. RDOC (Castellanos 2013) Data is often shared only in its raw form – must be preprocessed to remove nuisance variation and to be made comparable across individuals and sites.
    3. 3. Connectomics is Big Data
    4. 4. Configurable Pipeline for the Analysis of Connectomes (CPAC) • Pipeline to automate preprocessing and analysis of large-scale datasets • Most cutting edge functional connectivity preprocessing and analysis algorithms • Configurable to enable “plurality” – evaluate different processing parameters and strategies • Automatically identifies and takes advantage of parallelism on multi-threaded, multi-core, and cluster architectures • “Warm restarts” – only re-compute what has changed • Open science – open source • http://fcp-indi.github.io Nypipe
    5. 5. Computing in the Amazon Cloud • No hardware capital cost • No hardware maintenance • No software installation or configuration* • Resources scale to meet need for no overhead • Available everywhere and to everybody • Allows access to exotic architectures, such as GPUs *If appropriate AMI is available
    6. 6. Amazon EC2 - Instance • The hardware on which your processing will run:
    7. 7. Instance Pricing • On-demand Pricing – Always available, fixed price, non-interruptible, most stable • Spot instances – Market to sell otherwise unused time, variable price, interruptible
    8. 8. Spot Instances • Prices fluctuate over time • If price exceeds the max you are willing to pay, your instances are terminated
    9. 9. Storage • S3 – Simple Storage Service – Secure and stable storage with a web service interface, pay for what you use – Big and slow, $0.03 per GB/Month – Can be accessed from anywhere • EBS – Elastic Block Storage – Provisioned storage (SSD HD) directly connected to instance, pay for what you provision – Fast and expensive, $0.10 per GB/Month – Persistent and transferrable • Instance Storage – SSD storage provided with some instances, included in instance price – Fast and free – Non-persistent and non-transferrable – good for cache
    10. 10. Amazon EC2 - Instance • The hardware on which your processing will run:
    11. 11. Data Transfer • In general, free in - pay out – Out to other Amazon service such as S3, EBS, etc is free – Out to Internet is $0.09 per GB (becomes slightly cheaper after 10TB or so)
    12. 12. Amazon Machine Images • Virtual machines that provide the software environment for your processing • You can build your own, or use one maintained by others
    13. 13. StarCluster • Star cluster simplifies the process of building a Sun Grid Engine based cluster in EC2 – Dynamically add and remove compute nodes – Uses spot instances – Provides scripts for performing many administrative tasks
    14. 14. C-PAC Amazon Machine Image Nypipe
    15. 15. Proof of concept • Preprocessed 1,112 datasets from ABIDE with C-PAC – 4 different preprocessing strategies (+/- temporal filter, +/- global signal regression) – 24 derivatives: • ReHo, ALFF, fALFF, 10 RSNs, VMHC, binary degree centrality, weighted degree centrality, lFCD, time courses for 5 atlases (AAL, TT, EZ, HO, CC200, CC400) http://preprocessed-connectomes-project.github.io/abide
    16. 16. • Requires 45 minute to process 1 dataset • 3 datasets can be processed in parallel • Processing results in .5GB of data Model Parameters
    17. 17. Cloud vs. Traditional Computing 0 5000 10000 15000 100 2000 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Number of Datasets Cost($) Instance Cost Storage Cost Transfer Cost 0 4000 8000 12000 100 2000 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Number of Datasets Time(hours) No Download Total Processing Time
    18. 18. Impact of Spot Instances Simulations using past 90 days of spot price history
    19. 19. Other Pipelines
    20. 20. What about HIPAA? • Amazon AWS meets FedRAMP and NIST 800-53 standards, which are more rigorous than HIPAA – Access to instances controlled using 256-bit AES – Default firewalls deny all outside access – EC2, EBS, and S3 storage are compatible with encryption • AWS HIPAA whitepaper – http://d0.awsstatic.com/whitepapers/compliance/AWS_HI PAA_Compliance_Whitepaper.pdf
    21. 21. C-PAC Amazon Machine Image Nypipe
    22. 22. Preprocessed INDI Data in the Cloud http://preprocessed-connectomes-project.github.io/ • Available through S3 Bucket generously provided by AWS • Raw INDI will be available soon
    23. 23. - HCP Data available in the cloud: - https://wiki.humanconnectome.org/display/PublicData/Home - Receive $100 AWS Credits at the HCP workshop in Hawaii - http://humanconnectome.org/course-registration/2015/exploring-the-human- connectome.php
    24. 24. Acknowledgements CPAC Team: Daniel Clark, Steven Giavasis and Michael Milham. NDAR “Cloud Team”: Christian Haselgrove, Dave Kennedy, and Jack van Horn. NDAR Team: Dan Hall, Brian Koser, David Obenshain, Svetlana Novikova, and Malcom Jackson. CPAC-NDAR integration was funded by a contract from NDAR. ABIDE Preprocessed data is hosted in a Public S3 Bucket provided by AWS.

    ×