This document discusses using cloud computing resources for cheminformatics applications. It describes how Hadoop and MapReduce can be used to perform large-scale parallel computations on chemical data and databases. Specific examples discussed include counting atoms in large datasets using MapReduce and performing substructure searches using SMARTS queries on Hadoop. The document also compares different approaches to programming Hadoop applications and how Pig Latin can simplify writing cheminformatics jobs for Hadoop.
Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. We have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a con.figuration .file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RTCUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry. We expect RT-CUDA to be needed by many industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. We have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a con.figuration .file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RTCUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry. We expect RT-CUDA to be needed by many industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
Talk on the upcoming Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.
The design of chemical libraries is usually informed by pre-existing characteristics and desired features. On the other hand, assesing the prospective performance of a new library is more difficult. Importantly, a given screening library is often screened in a variety of systems which can differ in cell lines, readouts, formats and so on. In this study we explore to what extent pre-existing libraries can shed light on the relation between library activity and assay features. Using an ontology such as the BAO, it is possible to construct a hierarchy of annotations associated with an assay. Based on this annotation hierarchy we can then ask how likely are molecules associated with a specific annotation, to be identified as active. To allow generalization we consider substrucural features, as represented by a structural key fingerprint, rather than whole molecules. We employ a Bayesian framework to quantify the the association between a substructural feature and a given assay annotation, using a set of NCGC assays that have been annotated with BAO terms. We discuss our approach to training the Bayesian model and describe benchmarks that characterize model performance relative to the position of the annotation in the BAO hierarchy. Finally we discuss the role of this approach in a library design workflow that includes traditional design features such as chemical space coverage and physicochemical properties but also takes in to account screening platform features.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
1. Cloudy
with
a
Touch
of
Cheminforma4cs
Rajarshi
Guha,
Tyler
Peryea,
Dac-‐Trung
Nguyen
NIH
Center
for
Advancing
Transla@onal
Science
Chemaxon
UGM
September
26th,
2012
Wellesley,
MA
2. Parallel
compu4ng
in
the
cloud
• Modern
cloud
vendors
make
provisioning
compute
resources
easy
– Allows
one
to
handle
unpredictable
loads
easily
– Pay
only
for
what
you
need
• Chemistry
applica<ons
don’t
usually
have
very
dynamic
loads
• But
large
scale
resources
are
an
opportunity
for
large
scale
(parallel)
computa<ons
3. All
HPC
is
not
equal
• Use
cloud
resources
in
• Make
use
of
cloud
• Huge
datasets
the
same
way
as
a
local
capabili<es
• Candidates
for
map-‐
cluster
• Old
algorithms,
new
reduce
• MIT
StarCluster
makes
infrastructure
• Involves
algorithm
this
easy
to
do
• Spot
instances,
SNS,
(re)design
SQS
SimpleDB,
S3,
etc
Legacy
Cloudy
Big
Data
HPC
HPC
HPC
hOp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa<cs-‐to-‐the-‐cloud
4. Big
data
&
cheminforma4cs
• Computa<on
over
large
chemical
databases
– Pubchem,
ChEMBL,
GDB-‐13,
…
• What
types
of
computa<ons?
– Searches
(substructure,
pharmacophore,
….)
– QSAR
models
&
predic<ons
over
large
data
• Fundamentally,
“big
chemical
data”
lets
us
explore
larger
chemical
spaces
5. Map-‐Reduce
copy
sort
Split 0 Map
merge
Reduce Part 0
Split 1 Map
merge
Reduce Part 1
Split 2 Map
K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 )
Tom
White,
Hadoop,
The
Defini/ve
Guide.
3rd
Ed.
O’Reilly
6. Coun4ng
atoms
• The
chemical
version
of
the
word
coun<ng
task
Arbitrary line Atom list (V2)
SMILES (V1) Atom
numbers (K1) Occurence (V2) Symbol (K2)
Symbol (K2)
1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...)
2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...)
. N1
. N1
. N1
152366, Nc1ccc2ncccc2c1N MAP
. Reduce
.
Atom
Count (V3)
Symbol (K3)
N,100
C,5684
.
.
.
7. The
Hadoop
ecosystem
Chukwa Zookeeper Flume Pig
HBase Mahout Avro Whirr
Map Reduce Engine Hama
Hadoop Distributed
Hive
Filesystem
Hadoop Common
Based
on
hOp://www.slideshare.net/informa<cacorp/101111-‐part-‐3-‐maO-‐asleO-‐the-‐hadoop-‐ecosystem
8. Cheminforma4cs
on
Hadoop
• Hadoop
and
Atom
Coun<ng
• Hadoop
and
SD
Files
• Cheminforma<cs,
Hadoop
and
EC2
• Pig
and
Cheminforma<cs
But
are
cheminforma@cs
problems
really
big
enough
to
jus@fy
all
of
this?
9. Simplifying
Hadoop
applica4ons
package gov.nih.ncgc.hadoop;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
import chemaxon.formats.MolFormatException;
reporter) throws IOException {
import chemaxon.formats.MolImporter;
• Raw
Hadoop
Molecule mol = MolImporter.importMol(value.toString());
import chemaxon.license.LicenseManager;
matches.set(mol.getName());
import chemaxon.license.LicenseProcessingException;
search.setTarget(mol);
import chemaxon.sss.search.MolSearch;
try {
import chemaxon.sss.search.SearchException;
if (search.isMatching()) {
import chemaxon.struc.Molecule;
output.collect(matches, one);
import org.apache.hadoop.conf.Configuration;
} else {
programs
can
import org.apache.hadoop.conf.Configured;
output.collect(matches, zero);
import org.apache.hadoop.filecache.DistributedCache;
}
import org.apache.hadoop.fs.Path;
} catch (SearchException e) {
import org.apache.hadoop.io.IntWritable;
}
import org.apache.hadoop.io.LongWritable;
}
import org.apache.hadoop.io.Text;
}
import org.apache.hadoop.mapred.FileInputFormat;
be
tedious
to
import org.apache.hadoop.mapred.FileOutputFormat;
public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text,
import org.apache.hadoop.mapred.JobClient;
IntWritable, Text, IntWritable> {
import org.apache.hadoop.mapred.JobConf;
private IntWritable result = new IntWritable();
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
public void reduce(Text key,
import org.apache.hadoop.mapred.OutputCollector;
Iterator<IntWritable> values,
import org.apache.hadoop.mapred.Reducer;
OutputCollector<Text, IntWritable> output,
write
import org.apache.hadoop.mapred.Reporter;
Reporter reporter) throws IOException {
import org.apache.hadoop.mapred.TextInputFormat;
while (values.hasNext()) {
import org.apache.hadoop.mapred.TextOutputFormat;
if (values.next().compareTo(one) == 0) {
import org.apache.hadoop.util.Tool;
result.set(1);
import org.apache.hadoop.util.ToolRunner;
output.collect(key, result);
}
import java.io.BufferedReader;
}
import java.io.FileReader;
}
import java.io.IOException;
}
import java.util.Iterator;
public int run(String[] args) throws Exception {
/**
JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class);
* SMARTS searching over a set of files using Hadoop.
jobConf.setJobName("smartsSearch");
*
* @author Rajarshi Guha
jobConf.setOutputKeyClass(Text.class);
*/
jobConf.setOutputValueClass(IntWritable.class);
public class SmartsSearch extends Configured implements Tool {
private final static IntWritable one = new IntWritable(1);
jobConf.setMapperClass(MoleculeMapper.class);
private final static IntWritable zero = new IntWritable(0);
jobConf.setCombinerClass(SmartsMatchReducer.class);
jobConf.setReducerClass(SmartsMatchReducer.class);
public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
jobConf.setInputFormat(TextInputFormat.class);
private String pattern = null;
jobConf.setOutputFormat(TextOutputFormat.class);
private MolSearch search;
jobConf.setNumMapTasks(5);
public void configure(JobConf job) {
if (args.length != 4) {
try {
System.err.println("Usage: ss <in> <out> <pattern> <license file>");
Path[] licFiles = DistributedCache.getLocalCacheFiles(job);
System.exit(2);
BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString()));
}
StringBuilder license = new StringBuilder();
String line;
FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
while ((line = reader.readLine()) != null) license.append(line);
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
reader.close();
jobConf.setStrings("pattern", args[2]);
LicenseManager.setLicense(license.toString());
} catch (IOException e) {
// make the license file available vis dist cache
} catch (LicenseProcessingException e) {
DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);
}
JobClient.runJob(jobConf);
pattern = job.getStrings("pattern")[0];
return 0;
search = new MolSearch();
}
try {
Molecule queryMol = MolImporter.importMol(pattern, "smarts");
public static void main(String[] args) throws Exception {
search.setQuery(queryMol);
} catch (MolFormatException e) {
int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);
}
}
SMARTS
based
}
}
final static IntWritable one = new IntWritable(1);
Text matches = new Text();
substructure
search
10. Pig
&
Pig
La4n
• Pig
La<n
programs
are
much
simpler
to
write
and
get
translated
to
A = load 'medium.smi' as (smiles:chararray);
B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
store B into 'output.txt';
Hadoop
code
SMARTS
search
in
Pig
La<n
• SQL-‐like,
requires
package gov.nih.ncgc.hadoop.pig;
import chemaxon.formats.MolImporter;
UDF
to
be
import chemaxon.sss.search.MolSearch;
import chemaxon.sss.search.SearchException;
import chemaxon.struc.Molecule;
import org.apache.pig.FilterFunc;
implemented
to
import org.apache.pig.data.Tuple;
import java.io.IOException;
perform
public class SMATCH extends FilterFunc {
static MolSearch search = null;
non-‐standard
tasks
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() < 2) return false;
String target = (String) tuple.get(0);
String query = (String) tuple.get(1);
try {
Molecule queryMol = MolImporter.importMol(query, "smarts");
search.setQuery(queryMol);
search.setTarget(MolImporter.importMol(target, "smiles"));
return search.isMatching();
} catch (SearchException e) {
e.printStackTrace();
}
return false;
}
} UDF
for
SMARTS
search
11. Going
beyond
chunking?
• All
the
preceding
use
cases
are
embarrassingly
parallel
– Chunking
the
input
data
and
applying
the
same
opera<on
to
each
chunk
– Very
nice
when
you
have
a
big
cluster
Are
there
algorithms
in
cheminforma@cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
12. Going
beyond
chunking?
• Applica<ons
that
make
use
of
pairwise
(or
higher
order)
calcula<ons
could
benefit
from
a
map-‐
reduce
incarna<on
– Doesn’t
necessarily
avoid
the
O(N2)
barrier
– Bioisostere
iden<fica<on
is
one
case
that
could
be
rephrased
as
a
map-‐reduce
problem
• Map-‐Reduce
Design
PaOerns
13. Iden4fying
MMPs
• First
step
in
iden<fying
bioisosteres
is
to
iden<fy
candidate
matched
molecular
pairs
– Naïve
all
pairs
comparison
– Predefined
list
of
transforma<ons
• Birch
et
al,
BMCL,
2009
– Fragment
intersec<on
• Hussain
et
al,
JCIM,
2010
– MCS
based
approaches
(e.g.,
WizePairZ)
• Warner
et
al,
JCIM,
2010
17. Seeded
bioisosteres
–
MR
style
• Do
pairwise
MCS
REDUCE
analysis
on
scaffold
• Collect
pairs
of
series
SMILES
for
a
given
• For
each
pair
SMIRKS
output
SMIRKS
• Store
in
DB,
or
transform
and
the
pair
of
SMILES
• Filter
by
ac<vity,
or
• …
MAP
18. Does
seeding
help?
• Doesn’t
bypass
the
O(N2)
barrier
–
does
reduce
the
constant
• Depends
on
how
many
scaffolds
and
the
number
of
member
for
1e+14
each
scaffold
• Certainly
useful
when
log Number of pairwise comparisons
1e+11
there
a
few
members
Method
per
scaffold
1e+08
all
seeded.7
seeded.21
• Highly
populated
seeded.100
scaffolds
can
throw
things
off
1e+05
1e+03 1e+05 1e+07
log Number of molecules
19. Data
• Exhaus<vely
fragmented
ChEMBL
13
• Iden<fied
scaffolds
with
N members
! 1.8
N scaffold
• Ended
up
with
231,875
scaffolds
1e+08
– Covers
235,693
unique
molecules
log Comparisons
– Average
of
7
members
per
scaffold
1e+05
– 95%
of
scaffolds
had
<
21
members
– 99.5%
had
<
74
members
1e+02
• The
0.05%
are
a
bit
problema<c
All Seeded
Method
20. Timing
experiments
• Selected
50
scaffolds
with
10
or
fewer
members
• Configured
so
as
to
have
~
5
maps
• Effec<ve
running
<me
for
the
en<re
job
is
3.8
min
200
on
Hadoop
150
– Only
needed
5
of
8
map
slots
on
our
“cluster”
Time (s) 100
• Takes
~
6
min
without
50
Hadoop
0
1 2 3 4 5
Job Number
21. Timing
experiments
• Selected
1000
scaffolds
with
20
or
fewer
members
– Ran
with
10
scaffolds
/
map
• Hadoop
run
<me
was
~
2
hr
15
– Most
maps
were
Number of Jobs
10
fast
(<
20
sec)
• Serial
evalua<on
5
would
be
>
7
hr
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log Time (s)
22. A
M-‐R
workflow
• We’re
currently
focused
on
just
the
MMP
step
as
as
a
MR
example
• Could
also
include
fragmenta<on
step
as
part
of
the
workflow
– But
a
pre-‐calculated
set
of
scaffolds
is
more
sensible
• Store
transforma<ons
and
members
in
HBase
• Link
with
ac<vity
data
and
apply
structure
&
ac<vity
filters
on
candidate
pairs
23. What
Hadoop
is
not
for
• Doesn’t
replace
an
actual
database
• It’s
not
uniformly
fast
or
efficient
• Not
good
for
ad
hoc
or
real-‐<me
analysis
• Generally
not
effec<ve
unless
dealing
with
massive
datasets
• All
algorithms
are
not
amenable
to
the
map-‐
reduce
method
24. Conclusions
• Cheminforma<cs
applica<ons
can
be
rehosted
or
rewriOen
to
take
advantage
of
cloud
resources
– Remotely
hosted
– Embarrassingly
parallel
/
chunked
– Map/reduce
• Ability
to
process
larger
structure
collec<ons
lets
us
explore
more
chemical
space
• “Big
data”
isn’t
really
that
big
in
chemistry
25. Conclusions
• Q:
But
are
cheminforma/cs
problems
really
big
enough
to
jus/fy
all
of
this?
• A:
Yes
–
virtual
libraries,
integra<ng
chemical
structure
with
other
types
and
scales
of
data
• Q:
Are
there
algorithms
in
cheminforma/cs
that
can
employ
map-‐reduce
at
the
algorithmic
level?
• A:
Yes
–
especially
when
we
consider
problems
with
a
combinatorial
flavor