Big data analysing genomics and the bdg project

Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li

Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION

• We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project

• Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP

 A software frame work for writing and processing vast amount of
data on large clusters reliably
 Basic concept :
 Divide - Divides input datasets into chunks and processed by map task
in parallel.
 Sorts
 Conquer - Merges and given as the input to the reduced tasks.
 Handles
 Scheduling
 Data distribution
 Synchronization
 Errors and faults
Map Reduce

• Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR

GM12878 - Genetic variation studies
K562 - Erythropoiesis
HepG2 - Metabolism disorders
HEK293 - Embryonic kidney
H54 - Glioblastoma
BJ - Skin fibroblast
Data Types

 Bio informaticians have their own specific file formats
Example:
 .fasta
 .sam
 .gtf
 .narrowpeak
 .vcf etc.
 Accessing file formats of similar data is difficult
 They are ASCII encoded
 ASCII – inefficient !!
DECOUPLING STORAGE

 An open source, high performance, distributed platform for genomic
analysis
 ADAM defines a:
 Data schema and layout on disk
 A Scala API
 A command line interface
What is ADAM?

 VM-Ware version:5.5 – Cloudera
 Java version 1.8
 Tool : ADAM
 Apache Avro
 Spark
SOFTWARES USED

• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets

 Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
 Cloudera
 HortonWorks
 MapR
 The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?

SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK

Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project

Big data analysing genomics and the bdg project

Big data analysing genomics and the bdg project

More Related Content

What's hot

Viewers also liked

Similar to Big data analysing genomics and the bdg project

Recently uploaded

Big data analysing genomics and the bdg project