Data analysis with
Galaxy on the Cloud
Enis Afgan
100sGB
100+
How to use the cloud?
1. Get an account on the supported cloud
2. Start a master instance via a launcher app
3. Use CloudM...
Agenda details
• Launch an instance
• Demonstrate the following CloudMan
features and prepare for the data
analysis part:
...
YOUR TURN
http://bit.ly/goc-ws
Launch an instance
1. Slides @ bit.ly/goc-ws
2. Load biocloudcentral.org
3. Enter the access key and secret key
provided a...
Scaling
computation
YES YES
NO
Agenda details
• Launch an instance ✓
• Demonstrate the following
CloudMan features and prepare for
the data analysis part...
Manual scaling
• Explicitly add 1 worker node to your cluster
• Node type
corresponds to
node
processing
capacity
• Resear...
Auto-scaling
Agenda details
• Launch an instance ✓
• Demonstrate the following CloudMan
features and prepare for the data
analysis part...
On human chromosome 22,
which coding exons have
the most SNPs in them?
A Rough Plan
•Get some data
•Coding exons on chromosome 22
•SNPs on chromosome 22
•Mess with it
•Identify which exons have...
Exons, from UCSC SNPs, from UCSC
Exons, from UCSC SNPs, from UCSC
Exons, from UCSC
SNPs, from UCSC
Overlap pairings
Exons, from UCSC SNPs, from UCSC
1
1
2
Exons, from UCSC
SNPs, from UCSC
Overlap pairings
Exon overlap counts
Exons, from UCSC
1
1
2
Exon overlap counts
Exons, from UCSC
1
1
2
Exon overlap counts
1
1
2
Join on exon name
0
0
0
Exons, from UCSC
1
1
2
Exon overlap counts
1
1
2
Join on exon name
0
0
0
1
1
2
Rearrange columns w/ cut
Your turn
http://usegalaxy.org/galaxy101
Slides @ http://bit.ly/gxy-ws
Agenda details
• Launch an instance ✓
• Demonstrate the following CloudMan
features and prepare for the data
analysis part...
ource
e over
Use the terminal (or install Secure Shell for Chrome)
SSH using user ubuntu and the password you chose
when launchi...
Once logged in
• You have full system access to your instance,
including sudo; use it as any other system
• galaxy user ex...
• Edit Galaxy’s configuration
$ sudo su galaxy
$ cd /mnt/galaxy/galaxy-app
$ vi universe_wsgi.ini
allow_library_path_paste...
Controlling Galaxy
• Start/stop Galaxy application
• Add an admin user
• Use the email you registered with
S3 bucket as a data
library
• Within Galaxy, create a Data Library, using S3 bucket
path as the data source (/mnt/workshop...
Sharing-an-Instance
• Share the entire CloudMan platform
• Includes all of user data and even the customizations
• Publish...
Management
Console
Application(s)
(eg, Galaxy)
1°
2°
3°
6°, 8°
9°
Persistent
data
repository
Start CloudMan
Setup services...
CLUSTER ON THE COMMAND
LINE
Distributed Job Manager
(DRM)
• Controls job execution on a (set of)
resource(s)
• A job: an invocation of a program
• Man...
Customize your instance -
install a new tool
$ cd /mnt/galaxy/export
$ wget
http://heanet.dl.sourceforge.net/project/dnacl...
Use the new tool in the cluster
mode
1. Create a new sample shell file to run the tool; call it job_script.sh
with the fol...
Upcoming SlideShare
Loading in …5
×

Data analysis with Galaxy on the Cloud

468 views

Published on

Workshop slides demonstrating how to setup a CloudMan cluster and use Galaxy to perform the Galaxy 101 exercise.

Published in: Science, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
468
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • This is based on the Galaxy101 example for human.
  • Note to self: workshop-data bucket is owned by the Galaxy outreach acct on AWS.
  • Data analysis with Galaxy on the Cloud

    1. 1. Data analysis with Galaxy on the Cloud Enis Afgan
    2. 2. 100sGB 100+
    3. 3. How to use the cloud? 1. Get an account on the supported cloud 2. Start a master instance via a launcher app 3. Use CloudMan’s web interface on the master instance to manage the platform
    4. 4. Agenda details • Launch an instance • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
    5. 5. YOUR TURN http://bit.ly/goc-ws
    6. 6. Launch an instance 1. Slides @ bit.ly/goc-ws 2. Load biocloudcentral.org 3. Enter the access key and secret key provided at http://bit.ly/ws-creds 4. Provide your email address 5. Use your initials as the cluster name 6. Set any password (and remember it) 7. Keep Large instance type 8. Start your instance Wait for the instance to start (~2-3 minutes) For more details, see http://cloudman.irb.hr
    7. 7. Scaling computation YES YES NO
    8. 8. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
    9. 9. Manual scaling • Explicitly add 1 worker node to your cluster • Node type corresponds to node processing capacity • Research use of Spot instances
    10. 10. Auto-scaling
    11. 11. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling ✓ • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
    12. 12. On human chromosome 22, which coding exons have the most SNPs in them?
    13. 13. A Rough Plan •Get some data •Coding exons on chromosome 22 •SNPs on chromosome 22 •Mess with it •Identify which exons have SNPs •Count SNPs per exon •Visualize our results
    14. 14. Exons, from UCSC SNPs, from UCSC
    15. 15. Exons, from UCSC SNPs, from UCSC Exons, from UCSC SNPs, from UCSC Overlap pairings
    16. 16. Exons, from UCSC SNPs, from UCSC 1 1 2 Exons, from UCSC SNPs, from UCSC Overlap pairings Exon overlap counts
    17. 17. Exons, from UCSC 1 1 2 Exon overlap counts
    18. 18. Exons, from UCSC 1 1 2 Exon overlap counts 1 1 2 Join on exon name 0 0 0
    19. 19. Exons, from UCSC 1 1 2 Exon overlap counts 1 1 2 Join on exon name 0 0 0 1 1 2 Rearrange columns w/ cut
    20. 20. Your turn http://usegalaxy.org/galaxy101 Slides @ http://bit.ly/gxy-ws
    21. 21. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling ✓ • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
    22. 22. ource
    23. 23. e over Use the terminal (or install Secure Shell for Chrome) SSH using user ubuntu and the password you chose when launching an instance: [local machine]$ ssh ubuntu@<instance IP address>
    24. 24. Once logged in • You have full system access to your instance, including sudo; use it as any other system • galaxy user exists on the system and should be used when manipulating Galaxy (sudo su galaxy) • Can submit any jobs via the standard qsub command
    25. 25. • Edit Galaxy’s configuration $ sudo su galaxy $ cd /mnt/galaxy/galaxy-app $ vi universe_wsgi.ini allow_library_path_paste = True
    26. 26. Controlling Galaxy • Start/stop Galaxy application • Add an admin user • Use the email you registered with
    27. 27. S3 bucket as a data library • Within Galaxy, create a Data Library, using S3 bucket path as the data source (/mnt/workshop-data) • This will import all the datasets into the Data Library • Import that dataset into a history
    28. 28. Sharing-an-Instance • Share the entire CloudMan platform • Includes all of user data and even the customizations • Publish a self-contained analysis • Make a note of the share-string and send it to your neighbor
    29. 29. Management Console Application(s) (eg, Galaxy) 1° 2° 3° 6°, 8° 9° Persistent data repository Start CloudMan Setup services 5° 4° 7° 10° CM-w CM-w CM-w ... FS 2 FS ... Instance block storage Contextualize image CloudMan MI CloudMan MI ... CloudMan MI CloudMan machine image 11° S3/Swift CloudMan instance Snap2 Snap.. /mnt/galaxy[Indices] /mnt/cm/paster.log cm-<hash> /usr/bin/ec2autorun.log /tmp/cm/cm_boot.log Troubleshooting 1 ° 2 ° 3 °
    30. 30. CLUSTER ON THE COMMAND LINE
    31. 31. Distributed Job Manager (DRM) • Controls job execution on a (set of) resource(s) • A job: an invocation of a program • Manages job load • Provides ability to monitor system and job status • Popular DRMs: Sun Grid Engine (SGE), Portable Batch Scheduler (PBS), TORQUE, Condor, Load Sharing Facility (LSF)
    32. 32. Customize your instance - install a new tool $ cd /mnt/galaxy/export $ wget http://heanet.dl.sourceforge.net/project/dnaclust/parallel_relea se_3/dnaclust_linux_release3.zip $ unzip dnaclust_linux_release3.zip $ cd dnaclust_linux_release3 $ chmod +x * $ cp /mnt/workshop-data/mtDNA.fasta . Get a copy of a sample dataset Don’t forget
    33. 33. Use the new tool in the cluster mode 1. Create a new sample shell file to run the tool; call it job_script.sh with the following content: #$ -cwd ./dnaclust -l -s 0.9 /mnt/workshop-data/mtDNA.fasta 2. Submit single job to SGE queue qsub job_script.sh 3. Check the queue: qstat -f 4. Job output will be in the local directory in file job_script.sh.o# 5. Start a number of instances of the script: qsub job_script.sh (*10) watch qstat –f 1. See all jobs lined up 6. See auto-scaling in (using /cloud) [1.5-2 mins] 7. Go back to command prmopt, see jobs being distributed

    ×