Writing data analysis pipeline as ruby gem

Writing
Data Analysis Pipeline
As Ruby Gem
Shi-Gang Wang
About me
{
name: ‘ Shi-Gang Wang ( Sean ) ’,
email: ‘ seanwang@goldenio.com ’,
working_at: ,
role: [‘ software engineer ’],
language: ‘ ruby ’,
github: ‘ https://github.com/seansg ’
}
Outline
❖ What is pipeline
❖ Disassemble pipeline
❖ Queue a pipeline
?
Writing data analysis pipeline as ruby gem
pineapple.txt
pineapple.txt
cat pineapple.txt
pineapple.txt
cat pineapple.txt
cat pineapple.txt | grep apple
pineapple.txt
cat pineapple.txt
cat pineapple.txt | grep apple
cat pineapple.txt | grep apple | wc -l
Write scripts to do one thing
Make scripts to work together
=> Pipeline
Writing data analysis pipeline as ruby gem
Take
as an example
CAGNUT
❖ Computational and Analytical Gear for Nucleic acid
Utilitarian Techniques
❖ DNA analysis pipeline
❖ Burrows-Wheeler Aligner (BWA) — in C
❖ Sequence Alignment/Map tools (SAMtools) — in C
❖ Genome Analysis Toolkit (GATK) — in Java
❖ Picard — in Java
❖ Generate bash scripts
A Genome Analysis
Flowchart
https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png
Writing data analysis pipeline as ruby gem
Demo
How to write the pipeline?
How to write the pipeline?
How to disassemble the
pipeline?
Think about the pipeline
structure
Pipeline
Tools
CAGNUT
Core
Think about the pipeline
structure
Pipeline
CAGNUT
Core
Tools
Write all parts as ruby gems
Benefits of ruby gems
❖ Reuse
❖ Debug
❖ Maintain
❖ Share
Difficulties
❖ Usage
❖ Integration of tools
❖ Execution order
❖ Automation
Prepare work — Define help
❖ “Help” can help you understand how to use the
commands of pipeline
Prepare work — Namespace
Skills of writing the gems
Part 1 — Tool gems
Part 2 — Pipeline gem
Part 3 — Cagnut core gem
Part 1 — Tool gems
❖ Tool written in Singleton
❖ Tool methods written in class
❖ Job scripts generation
Pipelin
e
Tools
CAGNUT
Core
Tool written in Singleton
Tool method written in class
Get specific variables in other
class
❖ Use Forwardable
Job scripts generate
❖ Use Tilt
❖ Generic interface to
multiple Ruby
template engines
Part 2 — Pipeline gem
❖ Require tool gems
❖ Create workflow with tool gems
❖ Generate the job list
Pipelin
e
Tools
CAGNUT
Core
Require tool gems
❖ Loading bundle env
Create workflow with tool
gems
❖ Composed by tool gems
❖ Order
❖ Dependency
Generate the job list
Part 3 — CAGNUT core gem
❖ Project template prepare
❖ Parameters handling
❖ Tool-specific methods overwrite
❖ Jobs control
Pipelin
e
Tools
CAGNUT
Core
Project template prepare
❖ Define bundle as Thor command
Parameter handing
❖ Use OptionParser
Tool-specific method
overwrite
❖ One tool, One configuration
❖ Using “Prepend” to overwrite
dev.af83.com/2012/10/19/ruby-2-0-module-prepend.html
Jobs Control — desktop run
❖ wait $!
❖ detach
Writing data analysis pipeline as ruby gem
If the data is large
or
much larger,
like the human genome
The size of the human
genome is
3 x109 base pairs (bps)
Each base pair takes 2 bits
(you can use 00, 01, 10, and 11 for T, G, C
and A)
2 x 3 x 109 bits
= 6 x109 bits
= 7.5x108 bytes
= ~700 MB
In a perfect world: ~700 MB
(just 3 billion letters)
In the real world: ~200 GB
(right off the genome sequencer)
Crash your desktop/laptop!
Long wait …
Resource allocation
Resource allocation
❖ Specifying the memory used by the
program
❖ Using Queueing System
What is Queueing System?
Queueing System
BD C A
Waiting JobsJob Finished Job
System
❖Queue
❖the list of waiting jobs
❖Queueing System
❖Waiting Jobs + Servers
Server n
Server 2
Server 1
In a desktop computer
Cluster Queues
Queueing System
❖ Props
❖ Jobs scheduling
❖ Load balancing
❖ Batch jobs execution
Queueing System
❖ Portable Batch System (PBS)
❖ Sun Grid Engine (SGE)
❖ Load Sharing Facility (LSF)
Submit jobs to Queueing
System
❖ Take LSF as an example
❖ Creating a job script
❖ Submitting the job
Demo
❖ Submit jobs to cluster
Acknowledgement
https://cagnut.golden.io
https://goldenio.com
Thanks
Backup
1 of 60

Recommended

JavaScript code academy - introduction by
JavaScript code academy - introductionJavaScript code academy - introduction
JavaScript code academy - introductionJaroslav Kubíček
817 views20 slides
Accumulo Summit Keynote 2018 by
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit
114 views48 slides
Gulp: Your Build Process Will Thank You by
Gulp: Your Build Process Will Thank YouGulp: Your Build Process Will Thank You
Gulp: Your Build Process Will Thank YouRadWorks
2.6K views10 slides
Improving your workflow with gulp by
Improving your workflow with gulpImproving your workflow with gulp
Improving your workflow with gulpfrontendne
583 views41 slides
Gulp - the streaming build system by
Gulp - the streaming build systemGulp - the streaming build system
Gulp - the streaming build systemSergey Romaneko
1.5K views27 slides
Golang Arg / CABA Meetup #5 - go-carbon by
Golang Arg / CABA Meetup #5 - go-carbonGolang Arg / CABA Meetup #5 - go-carbon
Golang Arg / CABA Meetup #5 - go-carbonEzequiel Maraschio
223 views17 slides

More Related Content

What's hot

Daniel Sikar: Hadoop MapReduce - 06/09/2010 by
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Skills Matter
1.3K views29 slides
Fat Jar Smackdown by
Fat Jar SmackdownFat Jar Smackdown
Fat Jar SmackdownRed Hat Developers
944 views11 slides
Streams for the Web by
Streams for the WebStreams for the Web
Streams for the WebDomenic Denicola
7.4K views38 slides
Aws Quick Dirty Hadoop Mapreduce Ec2 S3 by
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Skills Matter
1.6K views32 slides
JIP Pipeline System Introduction by
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introductionthasso23
717 views30 slides
Ansible by
AnsibleAnsible
AnsibleMichal Haták
442 views23 slides

What's hot(20)

Daniel Sikar: Hadoop MapReduce - 06/09/2010 by Skills Matter
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Skills Matter1.3K views
Aws Quick Dirty Hadoop Mapreduce Ec2 S3 by Skills Matter
Aws Quick Dirty Hadoop Mapreduce Ec2 S3Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Aws Quick Dirty Hadoop Mapreduce Ec2 S3
Skills Matter1.6K views
JIP Pipeline System Introduction by thasso23
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introduction
thasso23717 views
人間では判定できない101すくみじゃんけんをコンピュータに判定させたい for Keras.js by KatsuyaENDOH
人間では判定できない101すくみじゃんけんをコンピュータに判定させたい for Keras.js人間では判定できない101すくみじゃんけんをコンピュータに判定させたい for Keras.js
人間では判定できない101すくみじゃんけんをコンピュータに判定させたい for Keras.js
KatsuyaENDOH510 views
Ceph Korean Documentation by Junyoung Sung
Ceph Korean DocumentationCeph Korean Documentation
Ceph Korean Documentation
Junyoung Sung135 views
In-core compression: how to shrink your database size in several times by Aleksander Alekseev
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev1.5K views
Scylla Summit 2022: ScyllaDB Embraces Wasm by ScyllaDB
Scylla Summit 2022: ScyllaDB Embraces WasmScylla Summit 2022: ScyllaDB Embraces Wasm
Scylla Summit 2022: ScyllaDB Embraces Wasm
ScyllaDB744 views
Luigi presentation NYC Data Science by Erik Bernhardsson
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
Erik Bernhardsson60.5K views
Using R on High Performance Computers by Dave Hiltbrand
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand81 views
Getting Started with Gulp by Jure Šuvak
Getting Started with GulpGetting Started with Gulp
Getting Started with Gulp
Jure Šuvak354 views
Improving go-git performance by source{d}
Improving go-git performanceImproving go-git performance
Improving go-git performance
source{d}266 views
Top 10 Perl Performance Tips by Perrin Harkins
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
Perrin Harkins18.4K views

Similar to Writing data analysis pipeline as ruby gem

Quilt - Distributed Load Simulation from AWS by
Quilt - Distributed Load Simulation from AWSQuilt - Distributed Load Simulation from AWS
Quilt - Distributed Load Simulation from AWSAjith Jose
221 views19 slides
Engineer Engineering Software by
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering SoftwareYung-Yu Chen
204 views42 slides
Node Interactive Debugging Node.js In Production by
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In ProductionYunong Xiao
7.3K views103 slides
Distributed tracing with erlang/elixir by
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixirIvan Glushkov
720 views64 slides
Experiences with Microservices at Tuenti by
Experiences with Microservices at TuentiExperiences with Microservices at Tuenti
Experiences with Microservices at TuentiAndrés Viedma Peláez
692 views84 slides
Toolbox of a Ruby Team by
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
422 views38 slides

Similar to Writing data analysis pipeline as ruby gem(20)

Quilt - Distributed Load Simulation from AWS by Ajith Jose
Quilt - Distributed Load Simulation from AWSQuilt - Distributed Load Simulation from AWS
Quilt - Distributed Load Simulation from AWS
Ajith Jose221 views
Engineer Engineering Software by Yung-Yu Chen
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
Yung-Yu Chen204 views
Node Interactive Debugging Node.js In Production by Yunong Xiao
Node Interactive Debugging Node.js In ProductionNode Interactive Debugging Node.js In Production
Node Interactive Debugging Node.js In Production
Yunong Xiao7.3K views
Distributed tracing with erlang/elixir by Ivan Glushkov
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
Ivan Glushkov720 views
Toolbox of a Ruby Team by Arto Artnik
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
Arto Artnik422 views
Modern javascript localization with c-3po and the good old gettext by Alexander Mostovenko
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettext
Practical virtual network functions with Snabb (SDN Barcelona VI) by Igalia
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
Igalia116 views
CPANTS: Kwalitative website and its tools by charsbar
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
charsbar1.2K views
Debugging node in prod by Yunong Xiao
Debugging node in prodDebugging node in prod
Debugging node in prod
Yunong Xiao186.9K views
Terraform in deployment pipeline by Anton Babenko
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipeline
Anton Babenko2.9K views
Galaxy by bosc
GalaxyGalaxy
Galaxy
bosc623 views
The Popper Experimentation Protocol and CLI tool by Ivo Jimenez
The Popper Experimentation Protocol and CLI toolThe Popper Experimentation Protocol and CLI tool
The Popper Experimentation Protocol and CLI tool
Ivo Jimenez177 views
Spark streaming by Noam Shaish
Spark streamingSpark streaming
Spark streaming
Noam Shaish1.3K views
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to... by Chris Fregly
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly2.5K views
Time to say goodbye to your Nagios based setup by Check my Website
Time to say goodbye to your Nagios based setupTime to say goodbye to your Nagios based setup
Time to say goodbye to your Nagios based setup
Check my Website20.2K views
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan by NETWAYS
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver JanOSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
OSMC 2014: Time to say goodbye to your Nagios setup | Oliver Jan
NETWAYS726 views
FunctionalJS - May 2014 - Streams by darach
FunctionalJS - May 2014 - StreamsFunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - Streams
darach1.2K views
Docker for Development by allingeek
Docker for DevelopmentDocker for Development
Docker for Development
allingeek303 views
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +... by Chris Fregly
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly1.4K views

Recently uploaded

a timeline of the history of linguistics- BAUTISTA- BELGERA.pdf by
a timeline of the history of linguistics- BAUTISTA- BELGERA.pdfa timeline of the history of linguistics- BAUTISTA- BELGERA.pdf
a timeline of the history of linguistics- BAUTISTA- BELGERA.pdfFordBryantSadio
6 views46 slides
EADR DRDO by
EADR DRDOEADR DRDO
EADR DRDOAdityaThakre13
8 views39 slides
unmasking toxicity in online gaming by
unmasking toxicity in online gamingunmasking toxicity in online gaming
unmasking toxicity in online gamingaminabumelha
5 views10 slides
UNIT XIII Cognitive Process.pptx by
UNIT XIII Cognitive Process.pptxUNIT XIII Cognitive Process.pptx
UNIT XIII Cognitive Process.pptxProf. (Dr.) Rahul Sharma
46 views62 slides
PB CV v0.3 by
PB CV v0.3PB CV v0.3
PB CV v0.3Pedro Borracha
15 views16 slides
ERGONOMIC RISK ASSESSMENT (ERA).pptx by
ERGONOMIC RISK ASSESSMENT (ERA).pptxERGONOMIC RISK ASSESSMENT (ERA).pptx
ERGONOMIC RISK ASSESSMENT (ERA).pptxj967z4hcnp
8 views7 slides

Recently uploaded(20)

a timeline of the history of linguistics- BAUTISTA- BELGERA.pdf by FordBryantSadio
a timeline of the history of linguistics- BAUTISTA- BELGERA.pdfa timeline of the history of linguistics- BAUTISTA- BELGERA.pdf
a timeline of the history of linguistics- BAUTISTA- BELGERA.pdf
FordBryantSadio6 views
unmasking toxicity in online gaming by aminabumelha
unmasking toxicity in online gamingunmasking toxicity in online gaming
unmasking toxicity in online gaming
aminabumelha5 views
ERGONOMIC RISK ASSESSMENT (ERA).pptx by j967z4hcnp
ERGONOMIC RISK ASSESSMENT (ERA).pptxERGONOMIC RISK ASSESSMENT (ERA).pptx
ERGONOMIC RISK ASSESSMENT (ERA).pptx
j967z4hcnp8 views
Learning from Disaster - How a positive safety approach saves lives_MYOSH Web... by kristinashields1
Learning from Disaster - How a positive safety approach saves lives_MYOSH Web...Learning from Disaster - How a positive safety approach saves lives_MYOSH Web...
Learning from Disaster - How a positive safety approach saves lives_MYOSH Web...
kristinashields136 views
Gym Members Community.pptx by nasserbf1987
Gym Members Community.pptxGym Members Community.pptx
Gym Members Community.pptx
nasserbf198711 views
Christan van Dorst - Hyteps by Dutch Power
Christan van Dorst - HytepsChristan van Dorst - Hyteps
Christan van Dorst - Hyteps
Dutch Power130 views
Managing Github via Terrafom.pdf by micharaeck
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdf
micharaeck5 views
Roozbeh Torkzadeh - TU Eindhoven by Dutch Power
Roozbeh Torkzadeh - TU EindhovenRoozbeh Torkzadeh - TU Eindhoven
Roozbeh Torkzadeh - TU Eindhoven
Dutch Power122 views
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf by ChrisFerris
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdfChris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf
ChrisFerris5 views
Helko van den Brom - VSL by Dutch Power
Helko van den Brom - VSLHelko van den Brom - VSL
Helko van den Brom - VSL
Dutch Power132 views

Writing data analysis pipeline as ruby gem

Editor's Notes

  1. 以上是我今天的outline 會講什麼是pipeline 如何disassemble pipeline 以及一些技巧 最後會說如何queue a pipeline
  2. 什麼是pipeline?
  3. 相信大家都對他不陌生 其實pipeline就是像是ppap一樣
  4. 舉例來說現在我有一個文件叫做pineapple.txt 假如我要在command line 看裡面的內容 我就會先用cat把內容倒出來
  5. 看到內容後如果看有興趣的資訊或許可以用grep把有關鍵字的部分抓出
  6. 抓出關鍵字後想知道有多少個關鍵字就可以用wc -l去做count
  7. 像這樣透過command line, script或是其他程式 一步一步地將資料過濾並處理成對我們有用的資訊
  8. 所以script或program一次只會做一件事情 而將他們一步步次序的串接在一起就是pipeline
  9. 所以說pipeline就像是ppap一樣 將東西全部都串在一起
  10. 接下來會用我們做的gem cagnut為範例
  11. cagnut是一個dna分析流程 其中包含了許多工具 例如用C語言和java語言寫成的工具 要產生可執行的bash script
  12. 這張圖表示一般在做dna 分析流程的步驟 在不同的資料處理步驟呼叫不同的程式工具來做處理
  13. 而在cagnut裡我們把這些工具甚至是pipeline本身都變成一個個的gem 透過gem的組合來dna 分析pipeline 所以DNA經過機器定序後產程的fastq檔,透過cagnut分析的pipeline中,不同工具的處理 將產生bash的script經過電腦運算後 會產生出不同格式的output供我們參考 因此cagnut目的就是要產生可執行的bash script
  14. 所以接下來大家可能會以為會講解如何寫pipeline
  15. 為什麼不寫pipeline 其實這裡會著重於如何拆解pipeline 因為容易修改及維護pipeline的各個部分 更有彈性
  16. 首先先想像一下pipeline 的架構 一個pipeline中會用到許多工具 如果我們可以把這些工具分別拆解成不同的部分 然後透過core來將pipeline和tools結合
  17. 像拼圖一樣透過組合的分式將它組合在一起 甚至可以根據不同的時機替換成不同的工具 甚至用不同的pipeline也可以用相同的工具 這樣不是很好嗎? 那要怎麼做到呢
  18. 那就將他們全部寫成ruby gems就解決了
  19. 寫成gem的好處有很多 例如不同的pipeline可以重複使用相同工具的gem 在做debug時可以根據不同的工具找出原因 在維護上也比較方便 甚至還可以分享給其他人使用
  20. 但我們再改寫的過程中有遇到了許多問題 像是改寫成gem之後要如何使用 不同語言的工具要如何整合在一起 執行的順序要如何安排 以及要如何自動化的執行jobs 接下來就會慢慢講到我們是如何解決這些問題的技巧
  21. pipeline的使用大多是直接在commline line直接下指令就可以執行了 但是如果改成gem會不太一樣 因此首先的準備工作就是定義好help help可以讓我們自己瞭解到要如何使用這些gem的套件和指令
  22. 再來就是定義好namespace,為什麼呢 也許你也叫羅傑斯(或自己覺得自己長得像羅傑斯),剛好美國隊長也叫羅傑斯,這樣在大街上大叫一聲羅傑斯就一堆人回頭了,不太好。 所以 Ruby 有設計 namespace 來解決這個問題。
  23. 準備工作準備好了之後就是要開始拆解pipeline 那在拆解的部分可以分成三個tool pipeline core 接下來會針對這三個部分分別解一下我們用到的技巧
  24. 在tool的部分 要解釋job 某的process
  25. 將tool寫成singleton 因為在pipeline中是用到工具的方法 所以我們只需要一個instance 而且每個tool都有自己的configuration 不能因為使用的方法不同而跟著改變
  26. tools有很多的方法 我們目的是要將執行的內容寫成scirpt 所以把method寫成class 一來可方便管理工具的方法 二來可以修改我們要改的內容
  27. 但在class會常用到許多設定檔裡面設定的變數 這時候我們可以用forwardable這個功能 重複寫相同的程式 避免create a child class and over/override method It’s probably exactly what you imagine it to be. to avoid repetition and keep things working efficiently.
  28. 接下來寫出要執行的script內容,為什麼要寫script 因為透過執行script可以把 並且減少工具整合上的困難 在這邊我們利用tilt來當作template工具 把要script執行的內容寫到檔案裡面 之後執行script後送到電腦中運算成為一個job
  29. 再來是pipeline gem的部分
  30. 在使用pipeline前必須讓pipeline可以認得tool gem 所以要在pipeline中載入gemfile bundler的設定
  31. 寫好工具之後 接著就是把工具串接成我們需要的pipeline 這邊要注意的地方是這些工具的使用順序以及jobs相依性
  32. 組合完成後 再來就是產生一個可以執行所有jobs的list 只要執行list就可以讓所有的job依序被執行
  33. 在執行pipeline之前會先產生一個資料夾 就是剛剛在demo看的分析資料夾 產生的方式我們利用thor來幫我們完成 但是產生完之後要能自動載入gemfile做bundle的動作 所以這邊我們把bundle的指令加進來
  34. pipeline中會有用到許多的參數, 包含指定設定檔和要使用哪個pipeline gem 因此我們用optionparser將參數做統一處理
  35. 因為不同的工具都有自己的設定方式 所以我們在這邊利用prepend的方式做method overwrite 讓工具可以自己針對自己的需求做修改
  36. 因為要處理deplendency的問題所以在單一機器上跑的時候會讓job都在背景執行,並且配合wait wait 的指示是等背景工作完成 但是如果有zombie的job,就沒辦法處理, 所以會另外搭配ruby detach detach可以避免zombie的jobs發生後繼續等待的情況
  37. 人類的基因體有3x10 bp 大家可能無法想像 所以
  38. 假設1bp 需要2bit的大小 那人類基因體就需要700mb
  39. 但是透過機器定序後 一個人的DNA大約會是200GB 因為會包含其他的資訊 像在國外的研究通常會以一百人以上為單位的話
  40. 如果是在自己的電腦上跑 可能就會
  41. 所以就要做資源分配
  42. 做資源分配有兩種 一種是針對不同的程式給不同的記憶體或cpu的使用設定 另一種就是用queueing system
  43. queue是排隊的意思 那queueing system就是可以 將排隊的jobs妥善分配到有足夠資源可以做計算server的系統
  44. 一般我們在使用個人電腦時幾乎都是用到core 下的porcess或thread為單位 來做多工或平行運算的部分
  45. 但是queueing system上則是用cluster的概念 每個cluster都是用有多少個cpu為單位來計算
  46. 用queueing system的好處是 可以做job的管理 處理機器的load balance 也可以批次執行jobs
  47. 目前queueing system大宗的有三種 pbs sge lsf pbs有open source的版本
  48. 所以像人類基因體這類的大資料最適合送到queueing system 上做運算 那要如何把job送到queue? 用lsf舉例來說 先把用做的jobs寫在script裡面 利用lsf特有的指令bsub來送出job,可以給job name 假如有第二個job的話,可以用wait的參數,等待第一個job完成後再執行