Big data analysing genomics and the bdg project

•Download as PPTX, PDF•

0 likes•243 views

sree navya

big data in genomics

Data & Analytics

Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li

Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION

• We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project

• Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP

 A software frame work for writing and processing vast amount of
data on large clusters reliably
 Basic concept :
 Divide - Divides input datasets into chunks and processed by map task
in parallel.
 Sorts
 Conquer - Merges and given as the input to the reduced tasks.
 Handles
 Scheduling
 Data distribution
 Synchronization
 Errors and faults
Map Reduce

• Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR

GM12878 - Genetic variation studies
K562 - Erythropoiesis
HepG2 - Metabolism disorders
HEK293 - Embryonic kidney
H54 - Glioblastoma
BJ - Skin fibroblast
Data Types

 Bio informaticians have their own specific file formats
Example:
 .fasta
 .sam
 .gtf
 .narrowpeak
 .vcf etc.
 Accessing file formats of similar data is difficult
 They are ASCII encoded
 ASCII – inefficient !!
DECOUPLING STORAGE

 An open source, high performance, distributed platform for genomic
analysis
 ADAM defines a:
 Data schema and layout on disk
 A Scala API
 A command line interface
What is ADAM?

 VM-Ware version:5.5 – Cloudera
 Java version 1.8
 Tool : ADAM
 Apache Avro
 Spark
SOFTWARES USED

• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets

 Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
 Cloudera
 HortonWorks
 MapR
 The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?

SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK

Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project

Big data analysing genomics and the bdg project

What's hot

Apache Spark in IndustryDorian Beganovic

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit

Intro to Python for C# DevelopersSarah Dutkiewicz

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Intro to Apache SparkMarius Soutier

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

Spark CoreTodd McGrath

Latest Developments in H2OSri Ambati

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Durga Gadiraju

Is there a SQL for NoSQL?Arthur Keen

Scala ecosystem - Dublin Scala Meetup, Oct 2018Mikhail Girkin

Apache Spark FundamentalsZahra Eskandari

Scaling Security Threat Detection with Apache Spark and DatabricksDatabricks

Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsDurga Gadiraju

Spark Summit EU talk by Tim HunterSpark Summit

Simplifying Big Data Applications with Apache Spark 2.0Spark Summit

Stacked Ensembles in H2OSri Ambati

What's hot (20)

Apache Spark in Industry

Apache Arrow: Cross-language Development Platform for In-memory Data

Spark Summit EU talk by Shay Nativ and Dvir Volk

Intro to Python for C# Developers

Resource-Efficient Deep Learning Model Selection on Apache Spark

Intro to Apache Spark

Apache Spark for Everyone - Women Who Code Workshop

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

Spark Core

Latest Developments in H2O

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...

Is there a SQL for NoSQL?

Scala ecosystem - Dublin Scala Meetup, Oct 2018

Apache Spark Fundamentals

Scaling Security Threat Detection with Apache Spark and Databricks

Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials

Spark Summit EU talk by Tim Hunter

Simplifying Big Data Applications with Apache Spark 2.0

Stacked Ensembles in H2O

Viewers also liked

drill management system sree navya

HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions ArchitectSpagoWorld

Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies

An Amzing SermonManoj Jacob

Carreras de CaballosRonald Padilla

Question 7Lucy Talbot

Lectura 1 Los números Irracionales Universidad Tecnológica Equinoccial

bw23-nyfinalpresentation-verizon-130426104853-phpapp02Laurie Shook, MBA

History of internetUsman Sajid

Emerging challenges in data-intensive genomicsmikaelhuss

Android Seminar || history || versions||application developement Shubham Pahune

7 Steps to Rocking Your Brand on Social MediaKatia Millar

Mubasher, M Phil synoses seminarMubasher Solangi

La emoción y el conocimiento van juntosWilliam Henry Vegazo Muro

Jenis turbin dan nozzle beserta komponennyaNur Ilham

Execuçao CBH Rio das VelhasCBH Rio das Velhas

Data analytics challenges in genomicsmikaelhuss

Classifications of Triangles by Ricardo C. LacsaRic Lacsa

2 6 rational function graphsLomasPreCalc

Diretrizes para elaboração de projetos ambientaisCBH Rio das Velhas

Viewers also liked (20)

drill management system

HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect

Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...

An Amzing Sermon

Carreras de Caballos

Question 7

Lectura 1 Los números Irracionales

bw23-nyfinalpresentation-verizon-130426104853-phpapp02

History of internet

Emerging challenges in data-intensive genomics

Android Seminar || history || versions||application developement

7 Steps to Rocking Your Brand on Social Media

Mubasher, M Phil synoses seminar

La emoción y el conocimiento van juntos

Jenis turbin dan nozzle beserta komponennya

Execuçao CBH Rio das Velhas

Data analytics challenges in genomics

Classifications of Triangles by Ricardo C. Lacsa

2 6 rational function graphs

Diretrizes para elaboração de projetos ambientais

Similar to Big data analysing genomics and the bdg project

Scala and Spark are Ideal for Big DataJohn Nestor

Spark WorkshopNavid Kalaei

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu

APACHE SPARK.pptxDeepaThirumurugan

Big Data tools in practiceDarko Marjanovic

Apache Cassandra training. Overview and BasicsOleg Magazov

Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge

New Developments in H2O: April 2017 EditionSri Ambati

DataOps with Project AmaterasuDataWorks Summit/Hadoop Summit

IBM Strategy for SparkMark Kerzner

Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab

Hadoop Introductionsheetal sharma

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit

Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters

Machine Learning With H2O vs SparkMLArnab Biswas

Michael stack -the state of apache h basehdhappy001

Sparkfatemehjamalii

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Hadoop enhancements using next gen IA technologiesBigdata Meetup Kochi

Similar to Big data analysing genomics and the bdg project (20)

Scala and Spark are Ideal for Big Data

Spark Workshop

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

APACHE SPARK.pptx

Big Data tools in practice

Apache Cassandra training. Overview and Basics

Sa introduction to big data pipelining with cassandra & spark west mins...

New Developments in H2O: April 2017 Edition

DataOps with Project Amaterasu

IBM Strategy for Spark

Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle

Hadoop Introduction

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...

Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014

Machine Learning With H2O vs SparkML

Michael stack -the state of apache h base

Spark

Combining Machine Learning frameworks with Apache Spark

Processing Large Data with Apache Spark -- HasGeek

Hadoop enhancements using next gen IA technologies

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

detection and classification of knee osteoarthritis.pptxAleenaJamil4

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

Multiple time frame trading analysis -brianshannon.pdfchwongval

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Identifying Appropriate Test Statistics Involving Population Mean

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Data Factory in Microsoft Fabric (MsBIP #82)

Top 5 Best Data Analytics Courses In Queens

Vision, Mission, Goals and Objectives ppt..pptx

Defining Constituents, Data Vizzes and Telling a Data Story

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Student profile product demonstration on grades, ability, well-being and mind...

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

DBA Basics: Getting Started with Performance Tuning.pdf

detection and classification of knee osteoarthritis.pptx

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

Multiple time frame trading analysis -brianshannon.pdf

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Big data analysing genomics and the bdg project

1. Sai Teja Vissamsetti (700645566) Sarika Batte (700647682) Chandana Sripathi (700641627) Krishna Chaitanya Koti (700648083) Krishna Chaitanya Gollavilli (700638821) Sree Navya Kovvuri (700645739) Sai Priyanka Reddy Addaboina (700648561) ANALYSING GENOMICS AND THE BDG PROJECT BIG DATA - Dr. Bo Li

2. Next generation DNA sequencing is rapidly transforming the life sciences into a data driven fields. • Traditional computational methods – difficult to use • More digitalised versions are developed INTRODUCTION

3. • We show the experienced Bio Informatician how to perform typical genomics tasks in the context of Spark. • Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command- line tools for large-scale genomics analysis. • We introduce the general Spark user to a new set of Hadoop-friendly serialization and file formats OVERVIEW of the Project

4. • Free java based programming frame work • Runs thousands of nodes involving thousands of terabytes • Rapid data transfer • Continue operating interpreted in case of node failure this frame work is used by Google Yahoo IBM • Scalable, cost effective, flexible, fast, resilient to failure HADOOP

5.  A software frame work for writing and processing vast amount of data on large clusters reliably  Basic concept :  Divide - Divides input datasets into chunks and processed by map task in parallel.  Sorts  Conquer - Merges and given as the input to the reduced tasks.  Handles  Scheduling  Data distribution  Synchronization  Errors and faults Map Reduce

6. • Also called as sequence-specific DNA binding factor • Controls the rate of genetic information • Larger genomes – more number of transcription factors TRANSCRIPTION FACTOR

7. GM12878 - Genetic variation studies K562 - Erythropoiesis HepG2 - Metabolism disorders HEK293 - Embryonic kidney H54 - Glioblastoma BJ - Skin fibroblast Data Types

8.  Bio informaticians have their own specific file formats Example:  .fasta  .sam  .gtf  .narrowpeak  .vcf etc.  Accessing file formats of similar data is difficult  They are ASCII encoded  ASCII – inefficient !! DECOUPLING STORAGE

9.  An open source, high performance, distributed platform for genomic analysis  ADAM defines a:  Data schema and layout on disk  A Scala API  A command line interface What is ADAM?

10.  VM-Ware version:5.5 – Cloudera  Java version 1.8  Tool : ADAM  Apache Avro  Spark SOFTWARES USED

11. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed • Presents a functional programing API, along with support for iterative programming. • Used at scale on clusters with >2k nodes, 4TB datasets

12.  Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros  Cloudera  HortonWorks  MapR  The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages failures WHY SPARK?

13. SPARK • Open source • In memory, on disk • Can be written in SCALA • API : SCALA, Java, python • Easy to program • Doesn’t need abstractions • Less compared to map reduce MAP REDUCE • Open source • On-disk • Can be written in java • API : java, python, SCALA • Difficult to program • Needs abstractions • More security features MAP REDUCE vs SPARK

14. Ingesting the full 1000 Genomes genotype data set – • Download the raw data directly into HDFS • Unzipping in-flight • Run an ADAM job to convert the data to Parquet Querying Genotypes from the 1000 Genomes Project

15. Building ADAM

16. Building Spark

Big data analysing genomics and the bdg project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data analysing genomics and the bdg project

Similar to Big data analysing genomics and the bdg project (20)

Recently uploaded

Recently uploaded (20)

Big data analysing genomics and the bdg project