SlideShare a Scribd company logo
1 of 35
Download to read offline
Hadoop 101 for 
bioinformaticians 
Attila Csordas 
Anybody used Hadoop before?
Aim 
1 hour:! 
theoretical introduction -> be able to assess whether your 
problem fits MR/Hadoop! 
! 
practical session -> be able to start developing code! 
!
Hadoop & me 
mostly non-coding bioinformatician at PRIDE! 
Cloudera Certified Hadoop Developer! 
~1300 hadoop jobs ~ 4000 MR jobs! 
3+ yrs of operator experience!
Linear scalability 
Search time = f(job complexity) 
6+ days on a 4 core 
local machine! 
Db build time = f(# of peptides) 
# of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr) 
5000 spectra 
100,000 
spectra 
200,000 
spectra
Theoretical introduction 
• Big Data! 
• Data Operating System! 
• Hadoop 1.0! 
• MapReduce! 
• Hadoop 2.0
Big Data 
difficult to process on a single machine! 
Volume: low TB - low PB! 
Velocity: generation rate! 
Variety: tab separated, machine data, documents! 
biological repositories fit the bell: e.g.. PRIDE
Data Operating System 
features! components! 
distributed storage 
scalable resource management 
fault-tolerant processing engine 1 
redundant processing engine 2
Hadoop 1.0 
storage 
Hadoop Distributed 
File System! 
aka HDFS! 
redundant, reliable, 
distributed 
processing engine/ 
cluster resource 
management 
MapReduce! 
distributed, fault-tolerant 
resource 
management, 
scheduling & 
processing engine 
single use data platform
MapReduce 
programming model for large scale, distributed, fault 
tolerant, batch data processing 
execution framework, the “runtime” 
software implementation
Stateless algorithms 
output only depends on the current input but not on 
previous inputs 
dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)
Parallelizable problems 
easy to separate to parallel tasks 
might need a to maintain a global, shared state
Embarassingly/pleasingly 
parallel problems 
easy to separate to parallel tasks 
no dependency or communication between those tasks
Functional programming 
roots 
higher-order functions that accept other functions 
as arguments 
! 
map: applies its argument f() to all elements of a 
list 
! 
fold: takes g() + initial value -> final value 
! 
sum of squares: map(x2), fold(+) + 0 as initial 
value
MapReduce 
map(key, value) : [(key, value)] 
! 
shuffle&sort: values grouped by key 
! 
reduce(key, iterator<value>) : [(key, value)]
Word Count 
Map(String docid, String text):! 
for each word w in text:! 
Emit(w, 1);! 
! 
Reduce(String term, Iterator<Int> values):! 
int sum = 0;! 
for each v in values:! 
sum += v;! 
Emit(term, value);! 
! 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
Count amino acids in peptides 
Map(byte offset of the line, peptide sequence): 
for each amino acid in peptide: 
Emit(amino acid, 1); 
! 
Reduce(amino acid, Iterator<Int> values): 
int sum = 0; 
for each v in values: 
sum += v; 
Emit(amino acid, value);
Mapper Mapper Mapper Mapper Mapper 
Intermediates Intermediates Intermediates Intermediates Intermediates 
Partitioner Partitioner Partitioner Partitioner Partitioner 
Intermediates Intermediates Intermediates 
Reducer Reducer Reduce 
(combiners omitted here) 
Source: redrawn from a slide by Cloduera, cc-licensed 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 
map map map map 
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 
combine combine combine combine 
a 1 b 2 c 9 a 5 c 2 b 7 c 8 
partition partition partition partition 
Shuffle and Sort: aggregate values by keys 
a 1 5 b 2 7 c 2 9 8 
reduce reduce reduce 
r1 s1 r2 s2 r3 s3 
3 6 8 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
0 AGELTEDEVER 12 DTNGSQFFITTVK 
26 IEVEKPFAIAKE 
map map map 
A 1 G 1 E 1 D 1 T 1 N 1 I 1 E 1 G 1 
Shuffle & sort: aggregate values by keys 
A 1 D 1 E 1 1 G 1 1 I 1 N 1 T 1 
reduce reduce reduce 
E 2 G 2 N 1
Hadoop 1.0: single use 
Pig: scripting 4 Hadoop, Yahoo!, 2006 
Hive: SQL queries 4 Hadoop Facebook 
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); 
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
Hadoop 2.0 
processing engines: 
MapReduce, Tez, 
HBase, Storm, 
Giraph, Spark… 
batch, interactive, 
online, streaming, 
graph … 
cluster resource 
management 
YARN! 
distributed, fault-tolerant 
resource 
management, 
scheduling & 
processing engine 
storage HDFS2! 
redundant, reliable, 
distributed 
multi use data platform
Practical session 
Objective: up & running 
w/ Hadoop in 30 mins
requirements 
• java 
• intellij idea 
• maven
Outline 
• check out https://github.com/attilacsordas/ 
hadoop_introduction 
• elements of a MapReduce app, basic Hadoop API & 
data types 
• bioinformatics toy example: counting amino acids in 
sequences 
• executing a hadoop job locally 
• 2 more examples building on top of AminoAcidCounter
https://github.com/attilacsordas/hadoop_for_bioinformatics
MapReduce app 
components 
• driver 
• mapper 
• reducer 
repo: bioinformatics.hadoop.AminoAcidCounter.java
Inputformats 
specifies data passed to Mappers 
! 
specified in driver 
! 
determines input splits and RecordReaders extracting key-value 
pairs 
! 
default formats: TextInputFormat, KeyValueTextInputFormat, 
SequenceFileInputFormat … 
! 
, LongWritable …
Data types 
everything implements Writable, de/serialization 
! 
all keys are WritableComparable: sorting keys 
! 
box classes for primitive data types: Text, IntWritable, 
LongWritable …
bioinformatics toy example 
Our task is to prepare data to form a statistical model on what 
physico-chemico properties of the constituent amino acid 
residues are affecting the visibility of the high-flyer peptides for 
a mass spectrometer. 
! 
AGELTEDEVER 
QSVHLVENEIQASIDQIFSHLER 
DTNGSQFFITTVK 
IEVEKPFAIAKE 
repo: headpeptides.txt in input folder
run the code! 
local 
pseudodistributed 
cluster
Debugging 
do it locally if possible 
print statements 
! 
write to map output 
! 
mapred.map.child.log.level=DEBUG 
mapred.reduce.child.log.level=DEBUG 
-> syslog task logfile 
! 
running debuggers remotely is hard 
! 
JVM debugging options 
! 
task profilers
Counters 
track records for statistics, malformed records 
AminoAcidCounterwithMalformedCounter
• Task 1: modify mapper to count positions too: 
R_2 -> AminoAcidPositionCounter 
! 
• Task 2: count positions & normalise to peptide 
length -> 
AminoAcidPositionCounterNormalizedToPeptide 
Length
Run jar on the cluster 
set up an account on a hadoop cluster 
! 
move input data into HDFS 
! 
set mainClass in pom.xml 
! 
mvn clean install 
! 
cp jar from target to servers 
! 
ssh into hadoop 
! 
run hadoop command line
homework, next steps 
count all the 3 outputs for 1 input k,v pair 
! 
count all amino acid pairs 
! 
run FastaAminoAcidCounter & check FastaInputFormat 
! 
run the jar on a cluster 
! 
https://github.com/lintool/MapReduce-course-2013s/ 
tree/master/slides 
! 
n * (trial, error) -> success

More Related Content

What's hot

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 

What's hot (20)

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 

Similar to Hadoop 101 for bioinformaticians

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and RubyManohar Amrutkar
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Similar to Hadoop 101 for bioinformaticians (20)

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Hadoop
HadoopHadoop
Hadoop
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

More from attilacsordas

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingattilacsordas
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological agingattilacsordas
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsattilacsordas
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?attilacsordas
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationattilacsordas
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...attilacsordas
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Positionattilacsordas
 
Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012attilacsordas
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Trainingattilacsordas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Projectattilacsordas
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Oldattilacsordas
 

More from attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Hadoop 101 for bioinformaticians

  • 1. Hadoop 101 for bioinformaticians Attila Csordas Anybody used Hadoop before?
  • 2. Aim 1 hour:! theoretical introduction -> be able to assess whether your problem fits MR/Hadoop! ! practical session -> be able to start developing code! !
  • 3. Hadoop & me mostly non-coding bioinformatician at PRIDE! Cloudera Certified Hadoop Developer! ~1300 hadoop jobs ~ 4000 MR jobs! 3+ yrs of operator experience!
  • 4. Linear scalability Search time = f(job complexity) 6+ days on a 4 core local machine! Db build time = f(# of peptides) # of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr) 5000 spectra 100,000 spectra 200,000 spectra
  • 5. Theoretical introduction • Big Data! • Data Operating System! • Hadoop 1.0! • MapReduce! • Hadoop 2.0
  • 6. Big Data difficult to process on a single machine! Volume: low TB - low PB! Velocity: generation rate! Variety: tab separated, machine data, documents! biological repositories fit the bell: e.g.. PRIDE
  • 7. Data Operating System features! components! distributed storage scalable resource management fault-tolerant processing engine 1 redundant processing engine 2
  • 8. Hadoop 1.0 storage Hadoop Distributed File System! aka HDFS! redundant, reliable, distributed processing engine/ cluster resource management MapReduce! distributed, fault-tolerant resource management, scheduling & processing engine single use data platform
  • 9. MapReduce programming model for large scale, distributed, fault tolerant, batch data processing execution framework, the “runtime” software implementation
  • 10. Stateless algorithms output only depends on the current input but not on previous inputs dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)
  • 11. Parallelizable problems easy to separate to parallel tasks might need a to maintain a global, shared state
  • 12. Embarassingly/pleasingly parallel problems easy to separate to parallel tasks no dependency or communication between those tasks
  • 13. Functional programming roots higher-order functions that accept other functions as arguments ! map: applies its argument f() to all elements of a list ! fold: takes g() + initial value -> final value ! sum of squares: map(x2), fold(+) + 0 as initial value
  • 14. MapReduce map(key, value) : [(key, value)] ! shuffle&sort: values grouped by key ! reduce(key, iterator<value>) : [(key, value)]
  • 15. Word Count Map(String docid, String text):! for each word w in text:! Emit(w, 1);! ! Reduce(String term, Iterator<Int> values):! int sum = 0;! for each v in values:! sum += v;! Emit(term, value);! ! source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 16. Count amino acids in peptides Map(byte offset of the line, peptide sequence): for each amino acid in peptide: Emit(amino acid, 1); ! Reduce(amino acid, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(amino acid, value);
  • 17. Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Partitioner Partitioner Partitioner Partitioner Partitioner Intermediates Intermediates Intermediates Reducer Reducer Reduce (combiners omitted here) Source: redrawn from a slide by Cloduera, cc-licensed source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 18. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 3 6 8 source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 19. 0 AGELTEDEVER 12 DTNGSQFFITTVK 26 IEVEKPFAIAKE map map map A 1 G 1 E 1 D 1 T 1 N 1 I 1 E 1 G 1 Shuffle & sort: aggregate values by keys A 1 D 1 E 1 1 G 1 1 I 1 N 1 T 1 reduce reduce reduce E 2 G 2 N 1
  • 20. Hadoop 1.0: single use Pig: scripting 4 Hadoop, Yahoo!, 2006 Hive: SQL queries 4 Hadoop Facebook hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  • 21. Hadoop 2.0 processing engines: MapReduce, Tez, HBase, Storm, Giraph, Spark… batch, interactive, online, streaming, graph … cluster resource management YARN! distributed, fault-tolerant resource management, scheduling & processing engine storage HDFS2! redundant, reliable, distributed multi use data platform
  • 22. Practical session Objective: up & running w/ Hadoop in 30 mins
  • 23. requirements • java • intellij idea • maven
  • 24. Outline • check out https://github.com/attilacsordas/ hadoop_introduction • elements of a MapReduce app, basic Hadoop API & data types • bioinformatics toy example: counting amino acids in sequences • executing a hadoop job locally • 2 more examples building on top of AminoAcidCounter
  • 26. MapReduce app components • driver • mapper • reducer repo: bioinformatics.hadoop.AminoAcidCounter.java
  • 27. Inputformats specifies data passed to Mappers ! specified in driver ! determines input splits and RecordReaders extracting key-value pairs ! default formats: TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat … ! , LongWritable …
  • 28. Data types everything implements Writable, de/serialization ! all keys are WritableComparable: sorting keys ! box classes for primitive data types: Text, IntWritable, LongWritable …
  • 29. bioinformatics toy example Our task is to prepare data to form a statistical model on what physico-chemico properties of the constituent amino acid residues are affecting the visibility of the high-flyer peptides for a mass spectrometer. ! AGELTEDEVER QSVHLVENEIQASIDQIFSHLER DTNGSQFFITTVK IEVEKPFAIAKE repo: headpeptides.txt in input folder
  • 30. run the code! local pseudodistributed cluster
  • 31. Debugging do it locally if possible print statements ! write to map output ! mapred.map.child.log.level=DEBUG mapred.reduce.child.log.level=DEBUG -> syslog task logfile ! running debuggers remotely is hard ! JVM debugging options ! task profilers
  • 32. Counters track records for statistics, malformed records AminoAcidCounterwithMalformedCounter
  • 33. • Task 1: modify mapper to count positions too: R_2 -> AminoAcidPositionCounter ! • Task 2: count positions & normalise to peptide length -> AminoAcidPositionCounterNormalizedToPeptide Length
  • 34. Run jar on the cluster set up an account on a hadoop cluster ! move input data into HDFS ! set mainClass in pom.xml ! mvn clean install ! cp jar from target to servers ! ssh into hadoop ! run hadoop command line
  • 35. homework, next steps count all the 3 outputs for 1 input k,v pair ! count all amino acid pairs ! run FastaAminoAcidCounter & check FastaInputFormat ! run the jar on a cluster ! https://github.com/lintool/MapReduce-course-2013s/ tree/master/slides ! n * (trial, error) -> success