SlideShare a Scribd company logo
基因大数据分析入门
@5fei
周五 11月11
议程
1:基因数据分析行业的现状:为什么hadoop没有被基因行业采用
A1: SGE VS 阿里批量计算 VS ADAM
A2: 安诺基因大数据路线图
2: ADAM项目介绍:ADAM为基因行业的分析带来什么好处
3:实验项目的整个架构和代码演示
4:实验项目 VS 原有程序
5:生信人员的工具使用:ADAM-SHELL ADAM-SUBMIT
6:开发环境简单介绍:IDEA,JAVA,SCALA,SBT,SPARK,ADAM
基因数据分析行业的现状
1:现在大多还是使用的LSF/SGE
2:科研机构多使用已经存在的服务器,而AWS,GOOGLECLOUD收费而且对其使用也不熟悉
3:处理1百万条的VCF用24个小时相对于测序是可以接受的
4:科研人员倾向于优化算法,而对数据的处理技术放在次要考虑的位置
5: hadoop对于基因处理的两个重要瓶颈:依赖网络传输数据和通过磁盘IO访问数据
SGE VS 阿里批量计算 VS ADAM
SGE
其本质是对生产环境物理服务器资源池化(开发环境也可以是虚机),它为任务的执行提供
物理服务器保证
缺点:
提交的任务只能在一台服务器上执行,其并发能力的受限于服务器cpu数目(任务开发人员也可
以自己做任务的拆分和汇总提高并发能力【重新发明轮子】,但开发难度大,代码复用率低,
容易出错)
阿里批量计算
缺点:
1: 集群的节点用户不能独立管理,连ssh访问都没有提供(所谓的节点其实是个容器?)
2:只能在经典网络创建,VPC不支持
3:没有分布式文件系统的产品支持(OSSFS不成熟,社区开发和维护不活跃)
ADAM基因数据分析平台
其本质是分布式计算,以公有云和私有云作为运行环境,用户提交的任务会被基因数据分析平
台自动切分并被投放到集群所有节点执行,并发能力只受限于集群的规模(执行时间和并发能
力成反比),资源利用率迅速提高。
优点:
任务的开发代码里面只包含业务代码(即基因数据分析代码),而关于任务的切分,并发控制
和同步由ADAM在后台完成,从而极大的降低了任务开发难度,而且和底层和系统调用松耦
合,任务可以轻松的适配新的平台, 产品从设计,开发到市场投放的周期大大的缩短,为公司在
未来的竞争提供强大的技术支持
缺点:
1: 原有的代码需要做迁移改造
安诺基因大数据路线图
ADAM项目介绍
ADAM是专门用于基因数据的处理和存储格式:主要好处
1:并行
2:存储空间小
Broad Institute‘s GATK 从V4版本开始支持adam
项目功能介绍
$ samtools view sample.rmdup.bam | more
项目功能介绍
$ cat win_100k.use_50mer | more
代码演示
实验项目的概览
实验项目 VS 原有程序
实验项目目前使用的scala语言实现的(也可用java或python或R)
原有程序代码片段
实验项目代码片段
生信人员的工具使用:ADAM-SHELL
$adam-shell
开发环境简单介绍
参考
● https://github.com/bigdatagenomics/adam
https://software.broadinstitute.org/gatk/
● https://www.biostars.org/
THANK YOU

More Related Content

Viewers also liked

Effective ansible
Effective ansibleEffective ansible
Effective ansible
Wu Bigo
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
Ian Foster
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panel
Ravi Madduri
 
HL7: Clinical Decision Support
HL7: Clinical Decision SupportHL7: Clinical Decision Support
HL7: Clinical Decision Support
Harvard Medical School, Partners Healthcare
 
Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline
Ravi Madduri
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Public.Cdsc.Middleton
Public.Cdsc.MiddletonPublic.Cdsc.Middleton
Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015
Ravi Madduri
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Ed Dodds
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
Yahoo Developer Network
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
Camera Culture Group, MIT Media Lab
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
Camera Culture Group, MIT Media Lab
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS Analysis
Ravi Madduri
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
Camera Culture Group, MIT Media Lab
 

Viewers also liked (19)

Effective ansible
Effective ansibleEffective ansible
Effective ansible
 
Supporting Barack Obama for President
Supporting Barack Obama for PresidentSupporting Barack Obama for President
Supporting Barack Obama for President
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
 
CI4CC sustainability-panel
CI4CC sustainability-panelCI4CC sustainability-panel
CI4CC sustainability-panel
 
HL7: Clinical Decision Support
HL7: Clinical Decision SupportHL7: Clinical Decision Support
HL7: Clinical Decision Support
 
Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline Role of Amyloid Burden in cognitive decline
Role of Amyloid Burden in cognitive decline
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Public.Cdsc.Middleton
Public.Cdsc.MiddletonPublic.Cdsc.Middleton
Public.Cdsc.Middleton
 
Jsm madduri-august-2015
Jsm madduri-august-2015Jsm madduri-august-2015
Jsm madduri-august-2015
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS Analysis
 
Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)
 
Stereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt HirschStereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt Hirsch
 
Multiview Imaging HW Overview
Multiview Imaging HW OverviewMultiview Imaging HW Overview
Multiview Imaging HW Overview
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
 

Similar to 基因大数据分析入门 Slideshare

Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
li luo
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructurekyhpudding
 
Big Java, Big Data
Big Java, Big DataBig Java, Big Data
Big Java, Big Data
Kuo-Chun Su
 
Oracle Golden Gate Introduction
Oracle Golden Gate IntroductionOracle Golden Gate Introduction
Oracle Golden Gate Introduction
jenkin
 
Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享
Etu Solution
 
選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲
Herman Wu
 
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
NTC.im(Notch Training Center)
 
Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势
Xicheng Dong
 
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China MobileCeph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Danielle Womboldt
 
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China MobileCeph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Community
 
大规模数据处理
大规模数据处理大规模数据处理
大规模数据处理Kay Yan
 
大规模数据处理
大规模数据处理大规模数据处理
大规模数据处理airsex
 
Android -汇博
Android -汇博Android -汇博
Android -汇博dlqingxi
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Jazz Yao-Tsung Wang
 
Hbase optimization and apply summary in taobao
Hbase optimization and apply summary in taobaoHbase optimization and apply summary in taobao
Hbase optimization and apply summary in taobaomingjian deng
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討
Billy Yang
 
AVM虛擬量測需求與應用趨勢
AVM虛擬量測需求與應用趨勢AVM虛擬量測需求與應用趨勢
AVM虛擬量測需求與應用趨勢
CHENHuiMei
 
Search engine
Search engineSearch engine
Search engine
Samchu Li
 

Similar to 基因大数据分析入门 Slideshare (20)

Hadoop 介紹 20141024
Hadoop 介紹 20141024Hadoop 介紹 20141024
Hadoop 介紹 20141024
 
Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里Qcon2013 罗李 - hadoop在阿里
Qcon2013 罗李 - hadoop在阿里
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructure
 
Big Java, Big Data
Big Java, Big DataBig Java, Big Data
Big Java, Big Data
 
Oracle Golden Gate Introduction
Oracle Golden Gate IntroductionOracle Golden Gate Introduction
Oracle Golden Gate Introduction
 
Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享
 
選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲選擇正確的Solution 來建置現代化的雲端資料倉儲
選擇正確的Solution 來建置現代化的雲端資料倉儲
 
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
淺談物聯網巨量資料挑戰 - Jazz 王耀聰 (2016/3/17 於鴻海內湖) 免費講座
 
Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势Hadoop yarn 基本架构和发展趋势
Hadoop yarn 基本架构和发展趋势
 
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China MobileCeph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
 
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China MobileCeph Day Beijing - Leverage Ceph for SDS in China Mobile
Ceph Day Beijing - Leverage Ceph for SDS in China Mobile
 
大规模数据处理
大规模数据处理大规模数据处理
大规模数据处理
 
大规模数据处理
大规模数据处理大规模数据处理
大规模数据处理
 
Android -汇博
Android -汇博Android -汇博
Android -汇博
 
Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)Big Data Projet Management the Body of Knowledge (BDPMBOK)
Big Data Projet Management the Body of Knowledge (BDPMBOK)
 
Hbase optimization and apply summary in taobao
Hbase optimization and apply summary in taobaoHbase optimization and apply summary in taobao
Hbase optimization and apply summary in taobao
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討
 
AVM虛擬量測需求與應用趨勢
AVM虛擬量測需求與應用趨勢AVM虛擬量測需求與應用趨勢
AVM虛擬量測需求與應用趨勢
 
Search engine
Search engineSearch engine
Search engine
 
Emc keynote 1130 1200
Emc keynote 1130 1200Emc keynote 1130 1200
Emc keynote 1130 1200
 

基因大数据分析入门 Slideshare