Cloudy with a Touch of Cheminformatics

Cloudy
with
a
Touch
of

Cheminforma4cs

Rajarshi
Guha,
Tyler
Peryea,
Dac-‐Trung
Nguyen

NIH
Center
for
Advancing
Transla@onal
Science

Chemaxon
UGM

September
26th,
2012

Wellesley,
MA

Parallel
compu4ng
in
the
cloud

•  Modern
cloud
vendors
make
provisioning

compute
resources
easy

–  Allows
one
to
handle
unpredictable
loads
easily

–  Pay
only
for
what
you
need

•  Chemistry
applica<ons
don’t
usually
have
very

dynamic
loads

•  But
large
scale
resources
are
an
opportunity
for

large
scale
(parallel)
computa<ons

All
HPC
is
not
equal

•  Use
cloud
resources
in
•  Make
use
of
cloud
•  Huge
datasets

the
same
way
as
a
local
capabili<es
•  Candidates
for
map-‐
cluster
•  Old
algorithms,
new
reduce

•  MIT
StarCluster
makes
infrastructure
•  Involves
algorithm

this
easy
to
do
•  Spot
instances,
SNS,
(re)design

SQS
SimpleDB,
S3,
etc

Legacy
Cloudy
Big
Data

HPC
HPC
HPC

hOp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa<cs-‐to-‐the-‐cloud

Big
data
&
cheminforma4cs

•  Computa<on
over
large
chemical
databases

–  Pubchem,
ChEMBL,
GDB-‐13,
…

•  What
types
of
computa<ons?

–  Searches
(substructure,
pharmacophore,
….)

–  QSAR
models
&
predic<ons
over
large
data

•  Fundamentally,
“big
chemical
data”
lets
us

explore
larger
chemical
spaces

Map-‐Reduce

copy
sort

Split 0 Map
merge

Reduce Part 0

Split 1 Map
merge

Reduce Part 1

Split 2 Map

K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 )
Tom
White,
Hadoop,
The
Deﬁni/ve
Guide.
3rd
Ed.
O’Reilly

Coun4ng
atoms

•  The
chemical
version
of
the
word
coun<ng
task

Arbitrary line Atom list (V2)
SMILES (V1) Atom
numbers (K1) Occurence (V2) Symbol (K2)
Symbol (K2)

1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...)
2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...)
. N1
. N1
. N1
152366, Nc1ccc2ncccc2c1N MAP
. Reduce

.
Atom
Count (V3)
Symbol (K3)

N,100
C,5684
.
.
.

The
Hadoop
ecosystem

Chukwa Zookeeper Flume Pig

HBase Mahout Avro Whirr

Map Reduce Engine Hama

Hadoop Distributed
Hive
Filesystem

Hadoop Common

Based
on
hOp://www.slideshare.net/informa<cacorp/101111-‐part-‐3-‐maO-‐asleO-‐the-‐hadoop-‐ecosystem

Cheminforma4cs
on
Hadoop

•  Hadoop
and
Atom
Coun<ng

•  Hadoop
and
SD
Files

•  Cheminforma<cs,
Hadoop
and
EC2

•  Pig
and
Cheminforma<cs

But
are
cheminforma@cs
problems

really
big
enough
to
jus@fy
all
of
this?

Simplifying
Hadoop
applica4ons

package gov.nih.ncgc.hadoop;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
import chemaxon.formats.MolFormatException;
reporter) throws IOException {
import chemaxon.formats.MolImporter;

•  Raw
Hadoop

Molecule mol = MolImporter.importMol(value.toString());
import chemaxon.license.LicenseManager;
matches.set(mol.getName());
import chemaxon.license.LicenseProcessingException;
search.setTarget(mol);
import chemaxon.sss.search.MolSearch;
try {
import chemaxon.sss.search.SearchException;
if (search.isMatching()) {
import chemaxon.struc.Molecule;
output.collect(matches, one);
import org.apache.hadoop.conf.Configuration;
} else {

programs
can

import org.apache.hadoop.conf.Configured;
output.collect(matches, zero);
import org.apache.hadoop.filecache.DistributedCache;
}
import org.apache.hadoop.fs.Path;
} catch (SearchException e) {
import org.apache.hadoop.io.IntWritable;
}
import org.apache.hadoop.io.LongWritable;
}
import org.apache.hadoop.io.Text;
}
import org.apache.hadoop.mapred.FileInputFormat;

be
tedious
to

import org.apache.hadoop.mapred.FileOutputFormat;
public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text,
import org.apache.hadoop.mapred.JobClient;
IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapred.JobConf;
private IntWritable result = new IntWritable();
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
public void reduce(Text key,
import org.apache.hadoop.mapred.OutputCollector;
Iterator<IntWritable> values,
import org.apache.hadoop.mapred.Reducer;
OutputCollector<Text, IntWritable> output,

write

import org.apache.hadoop.mapred.Reporter;
Reporter reporter) throws IOException {
import org.apache.hadoop.mapred.TextInputFormat;
while (values.hasNext()) {
import org.apache.hadoop.mapred.TextOutputFormat;
if (values.next().compareTo(one) == 0) {
import org.apache.hadoop.util.Tool;
result.set(1);
import org.apache.hadoop.util.ToolRunner;
output.collect(key, result);
}
import java.io.BufferedReader;
}
import java.io.FileReader;
}
import java.io.IOException;
}
import java.util.Iterator;
public int run(String[] args) throws Exception {
/**
JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class);
* SMARTS searching over a set of files using Hadoop.
jobConf.setJobName("smartsSearch");
*
* @author Rajarshi Guha
jobConf.setOutputKeyClass(Text.class);
*/
jobConf.setOutputValueClass(IntWritable.class);
public class SmartsSearch extends Configured implements Tool {
private final static IntWritable one = new IntWritable(1);
jobConf.setMapperClass(MoleculeMapper.class);
private final static IntWritable zero = new IntWritable(0);
jobConf.setCombinerClass(SmartsMatchReducer.class);
jobConf.setReducerClass(SmartsMatchReducer.class);
public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
jobConf.setInputFormat(TextInputFormat.class);
private String pattern = null;
jobConf.setOutputFormat(TextOutputFormat.class);
private MolSearch search;
jobConf.setNumMapTasks(5);
public void configure(JobConf job) {
if (args.length != 4) {
try {
System.err.println("Usage: ss <in> <out> <pattern> <license file>");
Path[] licFiles = DistributedCache.getLocalCacheFiles(job);
System.exit(2);
BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString()));
}
StringBuilder license = new StringBuilder();
String line;
FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
while ((line = reader.readLine()) != null) license.append(line);
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
reader.close();
jobConf.setStrings("pattern", args[2]);
LicenseManager.setLicense(license.toString());
} catch (IOException e) {
// make the license file available vis dist cache
} catch (LicenseProcessingException e) {
DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);
}
JobClient.runJob(jobConf);
pattern = job.getStrings("pattern")[0];
return 0;
search = new MolSearch();
}
try {
Molecule queryMol = MolImporter.importMol(pattern, "smarts");
public static void main(String[] args) throws Exception {
search.setQuery(queryMol);
} catch (MolFormatException e) {
int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);
}
}

SMARTS
based

}
}
final static IntWritable one = new IntWritable(1);
Text matches = new Text();

substructure
search

Pig
&
Pig
La4n

•  Pig
La<n
programs
are
much
simpler
to
write

and
get
translated
to
A = load 'medium.smi' as (smiles:chararray);
B = ﬁlter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
store B into 'output.txt';

Hadoop
code
SMARTS
search
in

Pig
La<n

•  SQL-‐like,
requires

package gov.nih.ncgc.hadoop.pig;

import chemaxon.formats.MolImporter;

UDF
to
be

import chemaxon.sss.search.MolSearch;
import chemaxon.sss.search.SearchException;
import chemaxon.struc.Molecule;
import org.apache.pig.FilterFunc;

implemented
to

import org.apache.pig.data.Tuple;

import java.io.IOException;

perform

public class SMATCH extends FilterFunc {
static MolSearch search = null;

non-‐standard
tasks

public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() < 2) return false;
String target = (String) tuple.get(0);
String query = (String) tuple.get(1);
try {
Molecule queryMol = MolImporter.importMol(query, "smarts");
search.setQuery(queryMol);
search.setTarget(MolImporter.importMol(target, "smiles"));
return search.isMatching();
} catch (SearchException e) {
e.printStackTrace();
}
return false;

}
} UDF
for
SMARTS
search

Going
beyond
chunking?

•  All
the
preceding
use
cases
are
embarrassingly

parallel

–  Chunking
the
input
data
and
applying
the
same

opera<on
to
each
chunk

–  Very
nice
when
you
have
a
big
cluster

Are
there
algorithms
in

cheminforma@cs
that

can
employ

map-‐reduce
at
the
algorithmic
level?

Going
beyond
chunking?

•  Applica<ons
that
make
use
of
pairwise
(or
higher

order)
calcula<ons
could
beneﬁt
from
a
map-‐
reduce
incarna<on

–  Doesn’t
necessarily
avoid
the
O(N2)
barrier

–  Bioisostere
iden<ﬁca<on
is
one
case
that
could
be

rephrased
as
a
map-‐reduce
problem

•  Map-‐Reduce
Design
PaOerns

Iden4fying
MMPs

•  First
step
in
iden<fying
bioisosteres
is
to
iden<fy

candidate
matched
molecular
pairs

–  Naïve
all
pairs
comparison

–  Predeﬁned
list
of
transforma<ons

•  Birch
et
al,
BMCL,
2009

–  Fragment
intersec<on

•  Hussain
et
al,
JCIM,
2010

–  MCS
based
approaches
(e.g.,
WizePairZ)

•  Warner
et
al,
JCIM,
2010

Naïve
Bioisostere
evalua4on

N
molecules
N(N-‐1)/2
comparisons

...

Scaﬀold
seeding

Seed
Fragment:

Members:

Scaﬀold
seeded
bioisosteres

M(M-‐1)/2
comparisons

M(M-‐1)/2
comparisons

Seeded
bioisosteres
–
MR
style

• Do
pairwise
MCS

REDUCE

analysis
on
scaﬀold

• Collect
pairs
of

series

SMILES
for
a
given

• For
each
pair
SMIRKS

output
SMIRKS

• Store
in
DB,
or

transform
and
the

pair
of
SMILES
• Filter
by
ac<vity,
or

• …

MAP

Does
seeding
help?

•  Doesn’t
bypass
the
O(N2)
barrier
–
does
reduce
the

constant

•  Depends
on
how
many
scaffolds
and
the

number
of
member
for
1e+14

each
scaffold

•  Certainly
useful
when

log Number of pairwise comparisons
1e+11

there
a
few
members
Method

per
scaffold
1e+08
all
seeded.7
seeded.21

•  Highly
populated

seeded.100

scaffolds
can
throw

things
off

1e+05

1e+03 1e+05 1e+07
log Number of molecules

Data

•  Exhaus<vely
fragmented
ChEMBL
13

•  Iden<fied
scaffolds
with

N members

! 1.8
N scaffold

•  Ended
up
with
231,875
scaffolds

1e+08

–  Covers
235,693
unique
molecules

log Comparisons
–  Average
of
7
members
per
scaffold
1e+05

–  95%
of
scaffolds
had
<
21
members

–  99.5%
had
<
74
members
1e+02

•  The
0.05%
are
a
bit
problema<c

All Seeded
Method

Timing
experiments

•  Selected
50
scaffolds
with
10
or
fewer
members

•  Configured
so
as
to
have
~
5
maps

•  Effec<ve
running
<me
for

the
en<re
job
is
3.8
min
200

on
Hadoop

150

–  Only
needed
5
of
8
map

slots
on
our
“cluster”
Time (s) 100

•  Takes
~
6
min
without
50

Hadoop

0

1 2 3 4 5
Job Number

Timing
experiments

•  Selected
1000
scaﬀolds
with
20
or
fewer

members

–  Ran
with
10
scaﬀolds
/
map

•  Hadoop
run
<me

was
~
2
hr

15

–  Most
maps
were

Number of Jobs

10

fast
(<
20
sec)

•  Serial
evalua<on
5

would
be
>
7
hr

0

1.0 1.5 2.0 2.5 3.0 3.5 4.0
log Time (s)

A
M-‐R
workflow

•  We’re
currently
focused
on
just
the
MMP
step
as

as
a
MR
example

•  Could
also
include
fragmenta<on
step
as
part
of

the
workflow

–  But
a
pre-‐calculated
set
of
scaffolds
is
more
sensible

•  Store
transforma<ons
and
members
in
HBase

•  Link
with
ac<vity
data
and
apply
structure
&

ac<vity
filters
on
candidate
pairs

What
Hadoop
is
not
for

•  Doesn’t
replace
an
actual
database

•  It’s
not
uniformly
fast
or
eﬃcient

•  Not
good
for
ad
hoc
or
real-‐<me
analysis

•  Generally
not
eﬀec<ve
unless
dealing
with

massive
datasets

•  All
algorithms
are
not
amenable
to
the
map-‐
reduce
method

Conclusions

•  Cheminforma<cs
applica<ons
can
be
rehosted
or

rewriOen
to
take
advantage
of
cloud
resources

–  Remotely
hosted

–  Embarrassingly
parallel
/
chunked

–  Map/reduce

•  Ability
to
process
larger
structure
collec<ons
lets

us
explore
more
chemical
space

•  “Big
data”
isn’t
really
that
big
in
chemistry

Conclusions

•  Q:
But
are
cheminforma/cs
problems
really
big

enough
to
jus/fy
all
of
this?

•  A:
Yes
–
virtual
libraries,
integra<ng
chemical

structure
with
other
types
and
scales
of
data

•  Q:
Are
there
algorithms
in
cheminforma/cs
that

can
employ
map-‐reduce
at
the
algorithmic
level?

•  A:
Yes
–
especially
when
we
consider
problems

with
a
combinatorial
ﬂavor

hRps://github.com/rajarshi/chem.hadoop

Cloudy with a Touch of Cheminformatics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Cloudy with a Touch of Cheminformatics

Similar to Cloudy with a Touch of Cheminformatics (20)

More from Rajarshi Guha

More from Rajarshi Guha (20)

Recently uploaded

Recently uploaded (20)

Cloudy with a Touch of Cheminformatics