Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li
Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION
• We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project
• Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP
 A software frame work for writing and processing vast amount of
data on large clusters reliably
 Basic concept :
 Divide - Divides input datasets into chunks and processed by map task
in parallel.
 Sorts
 Conquer - Merges and given as the input to the reduced tasks.
 Handles
 Scheduling
 Data distribution
 Synchronization
 Errors and faults
Map Reduce
• Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR
GM12878 - Genetic variation studies
K562 - Erythropoiesis
HepG2 - Metabolism disorders
HEK293 - Embryonic kidney
H54 - Glioblastoma
BJ - Skin fibroblast
Data Types
 Bio informaticians have their own specific file formats
Example:
 .fasta
 .sam
 .gtf
 .narrowpeak
 .vcf etc.
 Accessing file formats of similar data is difficult
 They are ASCII encoded
 ASCII – inefficient !!
DECOUPLING STORAGE
 An open source, high performance, distributed platform for genomic
analysis
 ADAM defines a:
 Data schema and layout on disk
 A Scala API
 A command line interface
What is ADAM?
 VM-Ware version:5.5 – Cloudera
 Java version 1.8
 Tool : ADAM
 Apache Avro
 Spark
SOFTWARES USED
• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets
 Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
 Cloudera
 HortonWorks
 MapR
 The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?
SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK
Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project
Building ADAM
Building Spark
Big data   analysing genomics and the bdg project

Big data analysing genomics and the bdg project

  • 1.
    Sai Teja Vissamsetti(700645566) Sarika Batte (700647682) Chandana Sripathi (700641627) Krishna Chaitanya Koti (700648083) Krishna Chaitanya Gollavilli (700638821) Sree Navya Kovvuri (700645739) Sai Priyanka Reddy Addaboina (700648561) ANALYSING GENOMICS AND THE BDG PROJECT BIG DATA - Dr. Bo Li
  • 2.
    Next generation DNAsequencing is rapidly transforming the life sciences into a data driven fields. • Traditional computational methods – difficult to use • More digitalised versions are developed INTRODUCTION
  • 3.
    • We showthe experienced Bio Informatician how to perform typical genomics tasks in the context of Spark. • Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command- line tools for large-scale genomics analysis. • We introduce the general Spark user to a new set of Hadoop-friendly serialization and file formats OVERVIEW of the Project
  • 4.
    • Free javabased programming frame work • Runs thousands of nodes involving thousands of terabytes • Rapid data transfer • Continue operating interpreted in case of node failure this frame work is used by Google Yahoo IBM • Scalable, cost effective, flexible, fast, resilient to failure HADOOP
  • 5.
     A softwareframe work for writing and processing vast amount of data on large clusters reliably  Basic concept :  Divide - Divides input datasets into chunks and processed by map task in parallel.  Sorts  Conquer - Merges and given as the input to the reduced tasks.  Handles  Scheduling  Data distribution  Synchronization  Errors and faults Map Reduce
  • 6.
    • Also calledas sequence-specific DNA binding factor • Controls the rate of genetic information • Larger genomes – more number of transcription factors TRANSCRIPTION FACTOR
  • 7.
    GM12878 - Geneticvariation studies K562 - Erythropoiesis HepG2 - Metabolism disorders HEK293 - Embryonic kidney H54 - Glioblastoma BJ - Skin fibroblast Data Types
  • 8.
     Bio informaticianshave their own specific file formats Example:  .fasta  .sam  .gtf  .narrowpeak  .vcf etc.  Accessing file formats of similar data is difficult  They are ASCII encoded  ASCII – inefficient !! DECOUPLING STORAGE
  • 9.
     An opensource, high performance, distributed platform for genomic analysis  ADAM defines a:  Data schema and layout on disk  A Scala API  A command line interface What is ADAM?
  • 10.
     VM-Ware version:5.5– Cloudera  Java version 1.8  Tool : ADAM  Apache Avro  Spark SOFTWARES USED
  • 11.
    • An in-memorydata parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed • Presents a functional programing API, along with support for iterative programming. • Used at scale on clusters with >2k nodes, 4TB datasets
  • 12.
     Current leadingmap-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros  Cloudera  HortonWorks  MapR  The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages failures WHY SPARK?
  • 13.
    SPARK • Open source •In memory, on disk • Can be written in SCALA • API : SCALA, Java, python • Easy to program • Doesn’t need abstractions • Less compared to map reduce MAP REDUCE • Open source • On-disk • Can be written in java • API : java, python, SCALA • Difficult to program • Needs abstractions • More security features MAP REDUCE vs SPARK
  • 14.
    Ingesting the full1000 Genomes genotype data set – • Download the raw data directly into HDFS • Unzipping in-flight • Run an ADAM job to convert the data to Parquet Querying Genotypes from the 1000 Genomes Project
  • 15.
  • 16.