1. Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li
2. Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION
3. • We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project
4. • Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP
5. A software frame work for writing and processing vast amount of
data on large clusters reliably
Basic concept :
Divide - Divides input datasets into chunks and processed by map task
in parallel.
Sorts
Conquer - Merges and given as the input to the reduced tasks.
Handles
Scheduling
Data distribution
Synchronization
Errors and faults
Map Reduce
6. • Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR
8. Bio informaticians have their own specific file formats
Example:
.fasta
.sam
.gtf
.narrowpeak
.vcf etc.
Accessing file formats of similar data is difficult
They are ASCII encoded
ASCII – inefficient !!
DECOUPLING STORAGE
9. An open source, high performance, distributed platform for genomic
analysis
ADAM defines a:
Data schema and layout on disk
A Scala API
A command line interface
What is ADAM?
10. VM-Ware version:5.5 – Cloudera
Java version 1.8
Tool : ADAM
Apache Avro
Spark
SOFTWARES USED
11. • An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets
12. Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
Cloudera
HortonWorks
MapR
The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?
13. SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK
14. Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project