Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Writing
Data Analysis Pipeline
As Ruby Gem
Shi-Gang Wang
About me
{
name: ‘ Shi-Gang Wang ( Sean ) ’,
email: ‘ seanwang@goldenio.com ’,
working_at: ,
role: [‘ software engineer ’]...
Outline
❖ What is pipeline
❖ Disassemble pipeline
❖ Queue a pipeline
?
pineapple.txt
pineapple.txt
cat pineapple.txt
pineapple.txt
cat pineapple.txt
cat pineapple.txt | grep apple
pineapple.txt
cat pineapple.txt
cat pineapple.txt | grep apple
cat pineapple.txt | grep apple | wc -l
Write scripts to do one thing
Make scripts to work together
=> Pipeline
Take
as an example
CAGNUT
❖ Computational and Analytical Gear for Nucleic acid
Utilitarian Techniques
❖ DNA analysis pipeline
❖ Burrows-Wheel...
A Genome Analysis
Flowchart
https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png
Demo
How to write the pipeline?
How to write the pipeline?
How to disassemble the
pipeline?
Think about the pipeline
structure
Pipeline
Tools
CAGNUT
Core
Think about the pipeline
structure
Pipeline
CAGNUT
Core
Tools
Write all parts as ruby gems
Benefits of ruby gems
❖ Reuse
❖ Debug
❖ Maintain
❖ Share
Difficulties
❖ Usage
❖ Integration of tools
❖ Execution order
❖ Automation
Prepare work — Define help
❖ “Help” can help you understand how to use the
commands of pipeline
Prepare work — Namespace
Skills of writing the gems
Part 1 — Tool gems
Part 2 — Pipeline gem
Part 3 — Cagnut core gem
Part 1 — Tool gems
❖ Tool written in Singleton
❖ Tool methods written in class
❖ Job scripts generation
Pipelin
e
Tools
CA...
Tool written in Singleton
Tool method written in class
Get specific variables in other
class
❖ Use Forwardable
Job scripts generate
❖ Use Tilt
❖ Generic interface to
multiple Ruby
template engines
Part 2 — Pipeline gem
❖ Require tool gems
❖ Create workflow with tool gems
❖ Generate the job list
Pipelin
e
Tools
CAGNUT
...
Require tool gems
❖ Loading bundle env
Create workflow with tool
gems
❖ Composed by tool gems
❖ Order
❖ Dependency
Generate the job list
Part 3 — CAGNUT core gem
❖ Project template prepare
❖ Parameters handling
❖ Tool-specific methods overwrite
❖ Jobs control...
Project template prepare
❖ Define bundle as Thor command
Parameter handing
❖ Use OptionParser
Tool-specific method
overwrite
❖ One tool, One configuration
❖ Using “Prepend” to overwrite
dev.af83.com/2012/10/19/ruby-2...
Jobs Control — desktop run
❖ wait $!
❖ detach
If the data is large
or
much larger,
like the human genome
The size of the human
genome is
3 x109 base pairs (bps)
Each base pair takes 2 bits
(you can use 00, 01, 10, and 11 for T, G, C
and A)
2 x 3 x 109 bits
= 6 x109 bits
= 7.5x108 by...
In a perfect world: ~700 MB
(just 3 billion letters)
In the real world: ~200 GB
(right off the genome sequencer)
Crash your desktop/laptop!
Long wait …
Resource allocation
Resource allocation
❖ Specifying the memory used by the
program
❖ Using Queueing System
What is Queueing System?
Queueing System
BD C A
Waiting JobsJob Finished Job
System
❖Queue
❖the list of waiting jobs
❖Queueing System
❖Waiting Jobs...
In a desktop computer
Cluster Queues
Queueing System
❖ Props
❖ Jobs scheduling
❖ Load balancing
❖ Batch jobs execution
Queueing System
❖ Portable Batch System (PBS)
❖ Sun Grid Engine (SGE)
❖ Load Sharing Facility (LSF)
Submit jobs to Queueing
System
❖ Take LSF as an example
❖ Creating a job script
❖ Submitting the job
Demo
❖ Submit jobs to cluster
Acknowledgement
https://cagnut.golden.io
https://goldenio.com
Thanks
Backup
Writing data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gem
Writing data analysis pipeline as ruby gem
Upcoming SlideShare
Loading in …5
×

Writing data analysis pipeline as ruby gem

155 views

Published on

在分析 DNA/RNA 資料上,已有許多工具可供使用,可透過不同工具的結合,可以找出可能導致疾病或癌症的變異點,但工具繁雜,每套工具所需要的參數及使用方法不同,控制每個步驟相當麻煩,必須精通各種分析工具的用法。 希望透過這個 talk 讓聽眾暸解,要串接 C 和 JAVA 等語言撰寫的分析工具時,可以利用 RUBY 的套件,撰寫簡單的 code 來處理複雜的分析流程.我們將工具與參數的使用,撰寫成 RUBY 模組,並利用 templete 系統組合出一個 robust 的資料分析流程(Cagnut)。 執行分析流程通常會花費相當大的資源與時間,我們將以 Cagnut 為核心,探討如何撰寫輔助套件(cagnut-cluster),讓 Cagnut 可以運用不同 cluster 上的排程系統,如 LSF、SGE、Torque (PBS),達到節省分析巨量資料的時間。

  • Be the first to comment

  • Be the first to like this

Writing data analysis pipeline as ruby gem

  1. 1. Writing Data Analysis Pipeline As Ruby Gem Shi-Gang Wang
  2. 2. About me { name: ‘ Shi-Gang Wang ( Sean ) ’, email: ‘ seanwang@goldenio.com ’, working_at: , role: [‘ software engineer ’], language: ‘ ruby ’, github: ‘ https://github.com/seansg ’ }
  3. 3. Outline ❖ What is pipeline ❖ Disassemble pipeline ❖ Queue a pipeline
  4. 4. ?
  5. 5. pineapple.txt
  6. 6. pineapple.txt cat pineapple.txt
  7. 7. pineapple.txt cat pineapple.txt cat pineapple.txt | grep apple
  8. 8. pineapple.txt cat pineapple.txt cat pineapple.txt | grep apple cat pineapple.txt | grep apple | wc -l
  9. 9. Write scripts to do one thing Make scripts to work together => Pipeline
  10. 10. Take as an example
  11. 11. CAGNUT ❖ Computational and Analytical Gear for Nucleic acid Utilitarian Techniques ❖ DNA analysis pipeline ❖ Burrows-Wheeler Aligner (BWA) — in C ❖ Sequence Alignment/Map tools (SAMtools) — in C ❖ Genome Analysis Toolkit (GATK) — in Java ❖ Picard — in Java ❖ Generate bash scripts
  12. 12. A Genome Analysis Flowchart https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png
  13. 13. Demo
  14. 14. How to write the pipeline?
  15. 15. How to write the pipeline? How to disassemble the pipeline?
  16. 16. Think about the pipeline structure Pipeline Tools CAGNUT Core
  17. 17. Think about the pipeline structure Pipeline CAGNUT Core Tools
  18. 18. Write all parts as ruby gems
  19. 19. Benefits of ruby gems ❖ Reuse ❖ Debug ❖ Maintain ❖ Share
  20. 20. Difficulties ❖ Usage ❖ Integration of tools ❖ Execution order ❖ Automation
  21. 21. Prepare work — Define help ❖ “Help” can help you understand how to use the commands of pipeline
  22. 22. Prepare work — Namespace
  23. 23. Skills of writing the gems Part 1 — Tool gems Part 2 — Pipeline gem Part 3 — Cagnut core gem
  24. 24. Part 1 — Tool gems ❖ Tool written in Singleton ❖ Tool methods written in class ❖ Job scripts generation Pipelin e Tools CAGNUT Core
  25. 25. Tool written in Singleton
  26. 26. Tool method written in class
  27. 27. Get specific variables in other class ❖ Use Forwardable
  28. 28. Job scripts generate ❖ Use Tilt ❖ Generic interface to multiple Ruby template engines
  29. 29. Part 2 — Pipeline gem ❖ Require tool gems ❖ Create workflow with tool gems ❖ Generate the job list Pipelin e Tools CAGNUT Core
  30. 30. Require tool gems ❖ Loading bundle env
  31. 31. Create workflow with tool gems ❖ Composed by tool gems ❖ Order ❖ Dependency
  32. 32. Generate the job list
  33. 33. Part 3 — CAGNUT core gem ❖ Project template prepare ❖ Parameters handling ❖ Tool-specific methods overwrite ❖ Jobs control Pipelin e Tools CAGNUT Core
  34. 34. Project template prepare ❖ Define bundle as Thor command
  35. 35. Parameter handing ❖ Use OptionParser
  36. 36. Tool-specific method overwrite ❖ One tool, One configuration ❖ Using “Prepend” to overwrite dev.af83.com/2012/10/19/ruby-2-0-module-prepend.html
  37. 37. Jobs Control — desktop run ❖ wait $! ❖ detach
  38. 38. If the data is large or much larger, like the human genome
  39. 39. The size of the human genome is 3 x109 base pairs (bps)
  40. 40. Each base pair takes 2 bits (you can use 00, 01, 10, and 11 for T, G, C and A) 2 x 3 x 109 bits = 6 x109 bits = 7.5x108 bytes = ~700 MB
  41. 41. In a perfect world: ~700 MB (just 3 billion letters) In the real world: ~200 GB (right off the genome sequencer)
  42. 42. Crash your desktop/laptop!
  43. 43. Long wait …
  44. 44. Resource allocation
  45. 45. Resource allocation ❖ Specifying the memory used by the program ❖ Using Queueing System
  46. 46. What is Queueing System?
  47. 47. Queueing System BD C A Waiting JobsJob Finished Job System ❖Queue ❖the list of waiting jobs ❖Queueing System ❖Waiting Jobs + Servers Server n Server 2 Server 1
  48. 48. In a desktop computer
  49. 49. Cluster Queues
  50. 50. Queueing System ❖ Props ❖ Jobs scheduling ❖ Load balancing ❖ Batch jobs execution
  51. 51. Queueing System ❖ Portable Batch System (PBS) ❖ Sun Grid Engine (SGE) ❖ Load Sharing Facility (LSF)
  52. 52. Submit jobs to Queueing System ❖ Take LSF as an example ❖ Creating a job script ❖ Submitting the job
  53. 53. Demo ❖ Submit jobs to cluster
  54. 54. Acknowledgement https://cagnut.golden.io https://goldenio.com
  55. 55. Thanks
  56. 56. Backup

×