Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Enabling
Discoveries
at
High
Throughput

Small
molecule
and
RNAi
HTS
at
the
NCTT

Rajarshi
Guha

NIH
Center
for
Transla6on
Therapeu6cs

May
3,
2011

Outline

•  Informa6cs
for
small
molecule
&
RNAi
screening

•  HCA
&
automated
decision
making

–  Pre7y
pictures
can
lead
to
more
eﬃcient
screens

•  Large
scale
cheminforma6cs

–  We
can
do
it,
but
do
we
need
to?

NIH Chemical Genomics Center
•  Founded
2004
as
part
of
NIH
Roadmap
Molecular
Libraries
Ini6a6ve

–  NCGC
staﬀed
with
90+
scien6sts
–
biologists,
chemists,
informa6cians,
engineers

–  Post-‐doc
program

•  Mission

–  MLPCN
(screening
&
chemical
synthesis;
compound
repository;
PubChem
database;

funding
for
assay,
library
and
technology
development
)

–  Develop
new
chemical
probes
for
basic
research
and
leads
for
therapeu6c
development,

par6cularly
for
rare/neglected
diseases

–  New
paradigms
&
applica6ons
of
HTS
for
chemical
biology
/
chemical
genomics

•  All
NCGC
projects
are
collabora6ons
with
a
target
or
disease
expert;

currently
>200

collabora6ons
with
inves6gators
worldwide

Project Diversity
Project
Diversity

(A) Disease areas (B) Target types

(C) Detection methods

Assay
formats
&
detec?on
methods
in
HTS

Assay formats
•  cellular signal transduction •  luminescence

•  ligand
binding
–  reporter gene –  chemiluminescence

–  compe66on
binding

–  bioluminescence

–  second messenger
•  enzyma6c
ac6vity
•  phenotypic –  BRET

–  biochemical
–  ALPHA

–  cellular
–  protein redistribution
•  ﬂuorescence

•  ion
or
ligand
transport
–  cell viability –  FI

–  Ion-‐sensi6ve
dyes
–  etc.
–  membrane
poten6al
dyes
Detection modes – 
– 
FRET

TRF

•  protein-‐protein
interac6ons

•  absorbance –  TR-‐FRET

–  biochemical
–  FP

–  cellular

•  radioactivity –  FCS

–  SPA –  FLT

Detector
Systems:
“Reading
the
assay”

•  ViewLux

–  Mul6modal
CCD-‐based
imager

•  Abs.,
Luminescence,
Fluorescence

•  Envision

–  PMT-‐based
reader

•  ALPHA

•  Acumen
Explorer

–  Laser
Scanning
Imager

•  “sta6c”
cell
cytometry

•  Hamamatsu
FDS
7000
Series

–  rapid
kine6cs

•  INCell1000

–  Subcellular
imaging

qHTS:
High
Throughput
Dose
Response

Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fitting

A
(high:~ 100 μM)
1536-well plates, inter-plate dilution series
and classification. 300K samples

C

Assay volumes 2 – 5 μL

B
Automated concentration-response data collection
~1 CRC/sec

Informa?cs
Ac?vi?es

•  High
throughput
curve
ﬁeng

•  Data
integra6on,
automated
cherry
picking

•  SAR
algorithms

–  QSAR
modeling

–  Fragment
based
analysis

–  Ac6vity
cliﬀs

•  Tools
–
standardizer,
tautomers,
fragment
acDvity

browser,
kinome
browser
and
more

•  RNAi
hit
selec6on,
OTE
analysis

•  High
content
analysis

Kinome
Navigator

•  Browse
kinase

panel
data

•  Currently
focused

on
the
Abbot

dataset

•  View

•  Fragments

•  Target
pairs

•  Kinome
overlay

hip://tripod.nih.gov

Fragment
Browser

•  View
ac6vi6es
on
a
fragment
wise
basis

•  Compare
ac6vity
distribu6ons
by
fragment

•  Currently
based
around
ChEMBL
assays
but
users

can
browse
their
own
compounds
&
ac6vi6es

hip://tripod.nih.gov

Structure
Ac?vity
Landscapes

•  Rugged
gorges
or
rolling
hills?

–  Small
structural
changes
associated
with
large

ac6vity
changes
represent
steep
slopes
in
the

landscape

–  But
tradi6onally,
QSAR
assumes
gentle
slopes

–  We
can
characterize
the
landscape
using
SALI

Maggiora,
G.M.,
J.
Chem.
Inf.
Model.,
2006,
46,
1535–1535

What
Can
We
Do
With
SALI’s?

•  SALI
characterizes
cliﬀs
&
non-‐cliﬀs

•  For
a

given
molecular
representa6on,
SALI’s

gives
us
an
idea
of

the

smoothness
of
the

SAR
landscape

•  Models
try
and
encode

this
landscape

•  Use
the
landscape
to
guide

descriptor
or
model

selec6on

Guha,
R.;
Van
Drie,
J.H.,
J.
Chem.
Inf.
Model.,
2008,
48,
646–658

Predic?ng
the
Landscape

•  Rather
than
predic6ng
ac6vity
directly,
we
can

try
to
predict
the
SAR
landscape

•  Implies
that
we
aiempt
to
directly
predict
cliﬀs

–  Observa6ons
are
now
pairs
of
molecules

Original
pIC50
SALI,
AbsDiﬀ
SALI,
GeoMean

RMSE
=
0.97
RMSE
=
1.10
RMSE
=
1.04

Scheiber
et
al,
StaDsDcal
Analysis
and
Data
Mining,
2009,
2,
115-‐122

Data
Integra?on

•  It’s
nice
to
simplify
data,
but
we
can
s6ll
be
faced

with
a
mul6tude
of
data
types

•  We
want
to
explore
these
data
in
a
linked
fashion

•  How
we
explore
and
what
we
explore
is
generally

inﬂuenced
by
the
task
at
hand

•  At
one
point,
make
inferences
over
all
the
data

Data
Integra?on

User’s
Network

Content:

-‐ Drugs

-‐ Compounds

-‐ Scaﬀolds

-‐ Assays

-‐ Genes

-‐ Targets

-‐ Pathways

-‐ Diseases

-‐ Clinical
Trials

-‐ Documents

Links:

Network
of
Public
Data
-‐Manually
curated

-‐Derived
from
algorithms

Record
View
of
an
Assay

Access
Disease
Hierarchy
&
Network

Ar?cles,
Patents,
Drug
Labels,
…

NPC
Browser

hip://tripod.nih.gov/npc/

Going
Beyond
Explora?on?

•  Simply
being
able
to
explore
data
in
an
integrated

manner
is
useful
as
an
idea
generator

•  Can
we
integrate
heterogenous
data
types
&

sources
to
get
a
systems
level
view?

–  Current
research
problem
in
genomics
and
systems

biology

–  Some
aiempts
have
been
made
to
merge
chemical

data
with
other
data
types

Young,
D.W.
et
al,
Nat.
Chem.
Biol.,
2008,
4,
59-‐68

RNAi
Facility
Mission

•  Perform
collabora6ve
genome-‐wide
RNAi
screening-‐
based
projects
with
intramural
inves6gators

•  Advance
the
science
of
RNAi
and
miRNA
screening

and
informa6cs
via
technology
development
to

improve
eﬃciency,
reliability,
and
costs.

Simple Phenotypes Pathway (Reporter Complex Phenotypes
(Viability, cytotoxicity, assays, e.g. luciferase, (High-content imaging, cell
oxidative stress, etc)! β-lactamase)! cycle, translocation, etc)!

Range of Assays!

RNAi
Eﬀectors

RNAi effectors provide an excellent way to conduct gene-speciﬁc loss of
function studies."

Issues
Using
RNAi
Eﬀectors

•  RNAi effectors give a knockdown not a knockout (70% - 80% is considered
good). Therefore, they may not silence enough to give a phenotype even if the
target is involved in what you are assaying for."
•  RNAi effectors induce off-target effects!!!!! "

Examples of of
Current
Projects

Examples
Current Projects

• 
Protein
Quality
Control
• 
Poxvirus

• 
DNA
Re-‐replica6on
• 
Respiratory
Viruses

• 
Base
Excision
Repair
• 
Lysosomal
Storage
Disorders

• 
DNA
Damage
–
ELG1
stabiliza6on
• 
Parkinsons
–
Mitochondrial
Quality

Control

• 
An6oxidant
Response

• 
Ewings
Sarcoma

• 
Hypoxia

• 
Drug
Modiﬁers,
Pancrea6c
Cancer

• 
TNFa
Response

• 
Drug
Modiﬁers,
TOP1
Clinical

• 
Interferon
Response

Agents

• 
iPS
to
RPE

• 
Immunotoxin-‐Mediated
Cell
Death

RNAi
Libraries

Ambion Human Genome- Ambion Mouse Genome-Wide
Wide Library, 21,585 genes, 3 Library, 17,582 genes, 3
unique siRNAs per gene. " unique siRNAs per gene."

Dharmacon Human Duet Human and Mouse miRNA
Genome-Wide siRNA Mimic Libraries &
Libraries, 18,236 genes, Human miRNA Inhibitor
siRNA pools." Library"

Qiagen Human Druggable Kinome Libraries"
Genome Library, > 7,000
Purchased from a number of
genes, 4 unique siRNAs per
vendors."
gene."

• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens
in systems less amenable to high throughput applications."
• Considerations are being made for additional species and shRNA resources."

Druggable
Genome
Screening
Campaign

Pseudo-colored Blue/Green Ratio
(Normalized to plate Median)

•  Over 7,000 genes, 4
unique siRNAs per gene
(≈36,000 wells).

•  85 genes were selected Significant enrichment for core
for follow-up through a NF-kB components
variety of threshold-based Percent Reduction in NF-kB Signal
100
selection schemes. Qiagen siRNAs
Ambion siRNAs
Average Inhibition (%)

80
•  27 genes were validated
as confident hits using 60

siRNAs from multiple 40
vendors.
20

0
TNFα Receptor IKKα

RELA NEMO

Druggable
Genome
Screening
Campaign

Significant enrichment for proteins that form the 28S
proteasome

Percent Reduction in NF-kB Signal Qiagen
Ambion RPN
100 19S
Regulator
particle
Average Inhibition (%)

80
RPT
60 α1-7 20S
ß1-7 Proteasome
40 α1-7

20 RPT
19S
Regulator
0 particle
RPN

D14
C4

C5

D2

D7
B2

B3

B4
A4

A5

A6

A7
A1

A2

A3

PSM Gene

Murata et al
PSM Protein α core 20S β core 20S RPT 19S RPN 19S Nature Reviews
Mol. Cell Biol.

An additional 34 genes remain inconclusive, but noteworthy hits that require further study.
Some of these tie into the core NF-kB pathway

Seed
Sequence
Analysis

Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not
exhibit signiﬁcant activity, adding to the likelihood of this being an on-target effect."

Seed
Sequence
Analysis

Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to
downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."

RNAi
&
Small
Molecule
Screens

What
targets
mediate
ac6vity
of

siRNA

and
compound

Pathway
elucida6on,
iden6ﬁca6on

• 
Reuse
pre-‐exis6ng
MLI
data
of
interac6ons

• 
Develop
new
annotated
libraries

CAGCATGAGTACTACAGGCCA

TACGGGAACTACCATAATTTA

Target
ID
and
valida6on

Link
RNAi
generated
pathway

peturba6ons
to
small
molecule

ac6vi6es.
Could
provide
insight
into

polypharmacology

• 
Run
parallel
RNAi
screen

Goal:
Develop
systems
level
view
of
small
molecule
acUvity

Matching
Phenotypes

RNAI

Small
Molecule

Merging
Screening
Technologies

•  Lead
iden6fica6on

High
throughput
screening
High
content
screening

•  Single
(few)
read
outs
•  Phenotypic
profiling

•  High-‐throughput
•  Mul6ple
parameters

•  Moderate
data
volumes
•  Moderate
throughput

•  Very
large
data

volumes

•  We’d
like
to
combine
the
technologies,
to
obtain
rich

high-‐resolu6on
data
at
high
speed

•  Is
this
feasible?
What
are
the
trade-‐offs?

Merging
Screening
Technologies

•  A
simple
solu6on
is
to
run
a
HTS
&
HCS
as

separate,
primary
&
secondary
screens

•  Alterna6vely
–
Wells
to
Cells

–  Integrate
HTS
&
HCS
in
a
single
screen
using
a

combined
plavorm
for
robo6cs
&
real
6me

automated
HTS
analy6cs

–  Selec6ve
imaging
of
interes6ng
wells

Wells
to
Cells
Workﬂow

•  Sequen6al
qHTS
using
laser

scanning
cytometry
followed

by
high-‐res
microscopy

•  Unit
of
work
is
a
plate
series

•  The
same
aliquot
is
analyzed

by
both
techniques

•  A
message
based
system

•  The
key
is
deciding
which

wells
go
through
the

workﬂow

Well
to
Cells
Assays

•  Cell
cycle,
cell
transloca6on,
DNA
repreplica6on

•  All
assays
run
against
LOPAC1280

•  Consistency
between
cytometry
&
microscopy
is

measured
by
the
R2
between
log
AC50’s

–  Cell
cycle,
0.94
–
0.96

–  Cell
transloca6on,
0.66
–
0.94

–  DNA
rereplica6on,
s6ll
in
progress

Cell
Transloca?on
Example
Hits

Informa?cs
Pla[orm

InCell
Layout

File

•  Advanced
correc6on
and

normaliza6on
methods

•  Sophis6cated
curve
ﬁeng

algorithm

•  Good
performance,
allows

paralleliza6on
of
the
en6re

workﬂow

Why
Messaging?

•  A
messaging
architecture
allows
for
significant

flexibility

–  Persistent,
can
be
kept
for
process
tracking,

repor6ng

–  Asynchronous,
allows
individual
components
of

the
workflow
to
proceed
at
their
own
pace

–  Modular,
new
components
can
be
introduced
at

any
6me
without
redesigning
the
whole
workflow

•  We
employ
Oracle
AQ,
but
any
message

queue
can
be
employed

Handling
Mul?ple
Pla[orms

•  Current
examples
employ
InCell
hardware

•  We
also
use
Molecular
Devices
hardware

•  As
a
result
we
have
two
orthogonal
image
stores
/

databases

•  Need
to
integrate
them

–  Support
seamless
data
browsing

across
mul6ple

screens
irrespec6ve
of
imaging
plavorm
used

–  Support
analy6cs
external
to
vendor
code

A
Uniﬁed
Interface

•  A
client
sees
a
single,
simple
interface
to

screening
image
data

hXp://host/rest/protocol/plate/well/image

•  Transparently
extract

image
data
via
the

MetaXpress
database

or
via
custom
code

•  Currently
the
interface
address
image
serving

•  Uniﬁed
metadata
interface
in
the
works

Trade-‐offs
&
Opportuni?es

•  Automa6on
reduces
the
ability
to
handle

unforeseen
errors

–  Dispense
errors
and
other
plate
problems

–  Well
selec6on
based
on
curve
classes
may
need
to

be
modified
on
the
fly

•  Well
selec6on
does
not
consider
SAR

–  Wells
are
selected
independently
of
each
other

–  If
we
could
model
SAR
on
the
fly
(or
from

valida6on
screens),
we’d
select
mul6ple
wells,
to

obtain
posi6ve
and
nega?ve
results

Cloud
Compu?ng
&
Cheminforma?cs

•  Cloud
compu6ng
is
a
hot
topic

•  A
number
of
examples
of
computa6onal

chemistry
/
cheminforma6cs
on
the
cloud

–  MolPlex,
hBar,
Numerate,
Wingu,
Sciligence,
Pﬁzer

•  Many
examples
use
the
cloud
for
remote
storage

remote
(hosted)
computa6ons

•  But
providers
such
as
Amazon
allow
us
to
run

distributed
compuDng
applica6ons
on
the
cloud

Map/Reduce

•  Map/Reduce
is
a
programming
model
for

efficient
distributed
compu6ng

•  M/R
made
“famous”
by
Google,
but
the
idea

has
been
around
for
a
long
6me

•  It
works
like
a
Unix
pipeline:

–  cat input | grep | sort | uniq -c | cat > output
– 

Input

|
Map

|
Shuffle
&
Sort

|

Reduce

|
Output

•  Efficiency
from

–  Streaming
through
data,
reducing
seeks

–  Pipelining

Owen
O’Malley,
hip://bit.ly/ecHPvB

Map/Reduce

Owen
O’Malley,
hip://bit.ly/ecHPvB

Hadoop
&
Cheminforma?cs

•  Hadoop
is
an
Open
Source
implementa6on

of
the
map/reduce
paradigm

•  Hadoop
is
a
framework
for
scalable,

distributed
compu6ng

–  Hadoop,
HDFS,
Hive,
PIG

•  Importantly,
you
can
play
with
all
this
on
your

laptop
and
just
copy
ﬁles
to
the
big
cluster
when

you’re
ready
for
produc6on

Why
Hadoop?

•  Simple
way
to
make
use
of
large
clusters

without
MPI
etc

•  AWS
supports
Hadoop,
so
easy
to
scale

up
to
100’s
or
1000’s
of
cores

•  Great
for
Java
code,
but
non-‐Java
code
can
also

make
use
of
Hadoop

•  M/R
can
be
applied
to
a
lot
of
problems,
but
one

of
the
simplest
is
to
use
it
as
a
“chunker”

Cheminforma?cs
in
Parallel

•  Many
cheminforma6cs
problems
are
data
parallel

–  Chunk
the
data
and
apply
the
same
technique
over

each
chunk

•  This
makes
many
problems
amenable
for
M/R

–  Substructure
/
pharmacophore
search

–  Descriptor
calcula6ons,
virtual
screening

–  Model
development
(?)

•  In
general,
each
chunk
is
processed
on
a
dis6nct

node
–
so
code
itself
can
be
non-‐parallel

Cheminforma?cs
in
Parallel

See
h_p://blog.rguha.net/?tag=hadoop
for
examples
&
code

Substructure
Searching

public class SubSearch {!

•  Substructure

…!
public static class MoleculeMapper extends !
Mapper<Object, Text, Text, IntWritable> {!

searching
is
a
trivial
private Text matches = new Text();!
private String pattern;!

extension
of
atom
public void setup(Context context) {!
pattern = context.getConfiguration().get
("net.rguha.dc.data.pattern");!

coun6ng

}!

public void map(Object key, Text value, Context context) throws!
IOException, InterruptedException {!

•  If
a
structure

try {!
IAtomContainer molecule = sp.parseSmiles(value.toString()); !

matches,
emit

sqt.setSmarts(pattern);!
boolean matched = sqt.matches(molecule);!
matches.set((String) molecule.getProperty(CDKConstants.TITLE));!
if (matched) context.write(matches, one);!

(name,1)!
else context.write(matches, zero);!
} catch (CDKException e) {!
e.printStackTrace();!
}!

•  Otherwise

}!
}!

public static class SMARTSMatchReducer extends !

(name,0)
Reducer<Text, IntWritable, Text, IntWritable> {!
private IntWritable result = new IntWritable();!

•  Reducer
simply

public void reduce(Text key, Iterable<IntWritable> values,!
Context context) throws IOException,
InterruptedException {!
for (IntWritable val : values) {!

outputs
tuples
of
the

if (val.compareTo(one) == 0) {!
result.set(1);!
context.write(key, result);!

form
(name,1)

}!
}!
}!

Running
on
AWS

•  All
the
code
was
debugged
on
my
laptop
with

rela6vely
small
files

•  To
test
the
scalability,
I
shi{ed
everything
to
AWS

–  Pharmacophore
search

–  136K
structures,
single

conformer,
560MB

–  Created
a
single
JAR
file
with

CDK
&
applica6on
code

–  Uploaded
data
files
to
S3

•  Total
cost
of
experiments

was
~
$10

But
I
Don’t
Want
to
Write
Programs

•  All
these
examples
require
us
to
write
full
ﬂedged

Java
classes

•  An
easier
way
to
use
Pig
&
Pig
La6n
–
a
plavorm

and
query
language
built
on
top
of
Hadoop

•  Lets
us
write
SQL-‐like
queries
that
make
use
of

Hadoop
underneath

•  Flexible
due
to
user
deﬁned
func6ons
(UDF’s)

–  UDF’s
encapsulate
the
cheminforma6cs

Cheminforma?cs
&
Pig

A = load 'medium.smi' as (smiles:chararray);!
B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!
store B into 'output.txt';!

•  Iden6fy
molecules
in
medium.smi
that
match
the

SMARTS
paiern
and
dump
to
output.txt

•  The
complexity
is
now
hidden
in
the
UDF

•  Many
toolkit
func6ons
could
be
wrapped
as

UDF’s,
allowing
ﬂexible
queries
with
much

simpler
code

•  See
hip://blog.rguha.net/?p=748
for
the
code

Latency

•  Hadoop
is
suited
for
batch
processing

•  Signiﬁcant
network
I/O
involved
in
distribu6ng

data
to
compute
nodes

•  Not
good
for

–  Random
ad
hoc
processing
of
small
subsets

–  Small
volume
data

–  Real
6me
(low
latency)
work

•  But
latency
issues
can
be
addressed
somewhat

by
Hbase,
Hive
and
other
technologies

More
than
Chunking?

•  But
all
the
examples
so
far
could
have
been
done

via
PBS/Condor
or
any
other
job
scheduler

–  (With
Hadoop
we
don’t
have
to
worry
about
explicit

chunking
of
the
input
data)

•  But
are
there
cheminforma6cs
algorithms
that

can
be
reworked
in
to
the
M/R
paradigm?

–  Predic6ve
modeling?

–  Graph
algorithms?

More
than
Chunking?

•  Both
predic6ve
&
graph
algorithms
are

increasingly
supported
in
Hadoop

–  Mahout
for
M/L
algorithms
on
massive
datasets

–  Cloud9
for
graph
algorithms

•  A
number
of
bioinforma6cs
applica6ons
make

use
of
M/R
at
the
algorithmic
level

•  They
are
all
big
applica6ons

–  Crossbow
aligns
3
billion
paired/unpaired
reads

•  Cheminforma?cs
datasets
are
not
very
big

Summary

•  HTS
data
is
an
ample
playground
for
interes6ng

analy6cs,
mul6ple
data
types
makes
it
more
fun

•  A
major
challenge
in
our
informa6cs

infrastructure
is
dealing
with
proprietary
vendor

interfaces

•  Hadoop
and
M/R
provide
great
opportuni6es
for

handling
large
data
in
a
ﬂexible
manner

•  But
can
cheminforma6cs
really
make
use
of
it?

Acknowledgments

InformaUcs
RNAi
&
Small
Molecule

•  Ajit
Jadhav
•  Scoi
Mar6n

•  Trung
Nguyen
•  Pinar
Tuzmen

•  Noel
Southall
•  Yu-‐Chi
Chen

•  Ruili
Huang
•  Carleen
Klump

•  Min
Shen
•  Craig
Thomas

•  Hongmao
Sun
•  Jim
Inglese

•  Xin
Hu
•  Ron
Johnson

•  Tongan
Zhao
•  Sam
Michael

•  Jennifer
Wichterman

Coun?ng
Atoms

•  The
canonical
Hadoop
program
is
to
count
the

frequency
of
words
in
a
text
ﬁle

–  Mapper
reads
a
line,
outputs
a
tuple
–
(word,
1)

–  Reducer
will
receive
tuples,
keyed
on
word!
•  Summing
up
the
1’s
gives
us
the
frequency
of
word

•  By
default,
Hadoop
works
on
a
line-‐by-‐line
basis

•  For
cheminforma6cs
problems,
SMILES
ﬁles

sa6sfy
this
requirement
–
one
line,
one
molecule

Coun?ng
Atoms

public class HeavyAtomCount {!

•  Uses
the
CDK
to
static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{!
!

parse
SMILES
private final static IntWritable one = new IntWritable(1);!
private Text word = new Text();!

•  For
each

public void map(Object key, Text value, Context context) throws !
IOException, InterruptedException {!
try {!
IAtomContainer molecule = sp.parseSmiles(value.toString());!

molecule
loop

for (IAtom atom : molecule.atoms()) {!
word.set(atom.getSymbol());!
context.write(word, one);!
}!

over
atoms

} catch (InvalidSmilesException e) {!
// do nothing for now!
}!
}!
}!

–  Emit

public static class IntSumReducer extends Reducer<Text, IntWritable, !
Text, IntWritable> {!
private IntWritable result = new IntWritable();!

(symbol,1)! public void reduce(Text key, Iterable<IntWritable> values,!
Context context) throws IOException, InterruptedException {!
int sum = 0;!

•  Reducer
simply

for (IntWritable val : values) {!
sum += val.get();!
}!
result.set(sum);!

sums
the
1’s
for

context.write(key, result);!
}!
}!
….!

each
symbol

}!

Mul?line
Records

•  Lots
of
cheminforma6cs
applica6ons
require
3D
–

SMILES
won’t
do.
Need
to
support
SDF

•  We
implement
a
custom
RecordReader to

process
SD
ﬁles!
•  We’re
now
ready
to

tackle
preiy
much

most

cheminforma6cs

tasks

Why
Hadoop?

•  Java
and
C++
APIs

–  In
Java
use
Objects,
while
in
C++
bytes

•  Each
task
can
process
data
sets
larger

than
RAM

•  Automa6c
re-‐execu6on
on
failure

–  In
a
large
cluster,
some
nodes
are
always
slow
or
ﬂaky

–  Framework
re-‐executes
failed
tasks

•  Locality
op6miza6ons

–  M/R
queries
HDFS
for
loca6ons
of
input
data

–  Map
tasks
are
scheduled
close
to
the
inputs
when

possible

Owen
O’Malley,
hip://bit.ly/ecHPvB

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Recommended

Recommended

More Related Content

Similar to Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Similar to Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT (20)

More from Rajarshi Guha

More from Rajarshi Guha (20)

Recently uploaded

Recently uploaded (20)

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT