Java tutorial: Programmatic Access to Molecular Interactions
1. v.1.0 - 10/07/2013
Programmatic Access to Molecular Interactions
Bruno Aranda (baranda@ebi.ac.uk)
Samuel Kerrien (skerrien@ebi.ac.uk)
Rafael C Jimenez (rafael@ebi.ac.uk)
Contents
5. Introduction to Molecular Interactions
Studying molecular interactions provides valuable insights into understanding a molecule's
role inside a specific cell type. There are several families of experimental protocols
commonly used to identify molecular interactions:
● Complementation assays (e.g. 2hybrid) measure the oligomerizationassisted
complementation of two fragments of a single protein which when united result in a
simple biological readout – the two protein fragments are fused to the potential
bait/prey interacting partners respectively. This methodology is easily scalable to high
throughput since it can yield very high numbers of coding sequences assayed in a
relatively simple experiment and a wide variety of interactions can be detected and
characterized following one single commonly used protocol. However, the proteins
are being expressed in an alien cell system with a loss of temporal and physiological
control of expression patterns, resulting in a large number of false positive
interactions being observed.
● Affinitybased assays (e.g. affinity chromatography, pulldown and
coimmunopreciptiation), rely on the strength of the interaction between two entities.
These techniques can be used on interactions which form under physiological
conditions but are only as good as the reagents and techniques used to identify the
participating proteins.
● Physical methods (e.g. Xray crystallography and enzymatic assays) depend on the
properties of molecules to enable measurement of an interaction. High quality data
can be produced but highly purified proteins are required, which has always proved a
rate limiting step. Availability of automated chromatography systems and custom
robotic systems that streamline the whole process, from cell harvesting and lysis
through to sample clarification and chromatography has changed this and increasing
amounts of data are being generated by such experiments.
Molecular interactions are crucial components of the cellular process. In order to understand
this complex machinery, one needs to gather published data from various sources. Many
projects have initiated the collection of interaction data for this purpose since 2002. However,
the lack of standardisation previously made the task of aggregating datasets difficult. This
issue has been resolved by the creation of Molecular Interaction standard in 2004 by
members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome
Organization (HUPO). Furthermore, major database providers have come together with the
goal to exchange data in order to optimise laborious curation tasks. Finally, tools and
frameworks have been created based on PSIMI standards to facilitate the visualisation and
analysis of molecular interaction data.
Molecular interactions are generally represented in graphical networks with nodes
corresponding to the molecules and edges to the interactions. Although edges can vary in
length most networks represent undirected and only binary interactions. Bioinformatics tools
and computational biology efforts into graph theory methods have and continue to be part of
6. the knowledge discovery process in this field. Analysis of interaction networks involves many
challenges, due to the inherent complexity of these networks, high noise level characteristic
of the data, and the presence of unusual topological phenomena. A variety of datamining and
statistical techniques have been applied to effective analyze interaction data and the resulting
networks. The major challenges for computational analysis of interaction networks remain:
● unreliability of large scale experiments;
● biological redundancy and multiplicity: a protein can have several different functions;
or a protein may be included in one or more functional groups. In such instances
overlapping clusters should be identified in the PPI networks, however since
conventional clustering methods generally produce pairwise disjoint clusters, they
may not be effective when applied to PPI networks;
● two proteins with different functions frequently interact with each other. Such frequent,
connections between the proteins in different functional groups expand the topological
complexity of the PPI networks, posing difficulties to the detection of unambiguous
partitions.
Intensive research trying to understand and characterize the structural behaviours of such
systems from a topological perspective have shown that features such as smallworld
properties (any two nodes can be connected via a short path of a few links), scalefree
degree distributions ( powerlaw degree distribution indicating that a few hubs bind numerous
small nodes), and hierarchical modularity (hierarchical organization of modules) suggests
that a functional module in an interaction network represents a maximal set of functionally
associated molecules. In other words, it is composed of those molecules that are mutually
involved in a given biological process or function. In this model, the significance of a few hub
nodes is emphasized, and these nodes are viewed as the determinants of survival during
network perturbations and as the essential backbone of the hierarchical structure.
8. a. export PATH=/lib/svn/bin:${PATH}
3. Open a new shell and type svn version to check your have the right version.
Java IDE
There are several major Java IDE available online and we are only giving information about
Eclipse and IntelliJ (that is our favorite in the IntAct team).
Download Eclipse: http://www.eclipse.org/downloads
Steps
1. Download and unpack Eclipse for Java Developer edition, then launch Eclipse;
2. Next step is to add support for Maven, to do so, open the Overview windows (figure 1)
and click on the Eclipse Marketplace. Install the Maven integration plugin (figure 2);
Alternatively, you can also click Help > Eclipse Market Place;
3. Restart Eclipse so that our newly installed plugin becomes available.
Figure 1. Eclipse Overview Figure 2. Maven plugin
Download IntelliJ: http://www.jetbrains.com/idea/download
Steps
1. Download and unpack IntelliJ;
2. Launch IntelliJ.
Note
IntelliJ has inbuilt Maven support so no additional plugin need installing at this stage.
11. intactcourse project structure
The intactcourse directory is structured as follow:
● pom.xml: Maven configuration for the project, including declaration of required third
party libraries, plugins (e.g. Java compiler)...
● src/main/java: source code for the course including exercises and solutions;
● src/main/resources: configuration files and sample data files required to run
some exercises.
How to run an exercise class ?
Each java based exercises provided is encoded in a dedicated class file for which the
package reflect the context of the exercise.
Example: intact.solution.psicquic.registry.ListingAllServices
Each of these classes has a main method so running it is easily done as explain below:
Command line with Maven
Steps
1. Go into the intactcourse directory
2. get the reference of the class you want to run, for instance:
intact.solution.psicquic.registry.ListingAllServices
3. Type:
mvn exec:java
Dexec.mainClass=intact.solution.psicquic.registry.ListingAllServices
Eclipse
Steps
4. Right click on the class you want to run and choose: Run As > Java Application
5. If not done already, the project will compile and the output of the class will be shown in
a console.
IntelliJ
Steps
1. Right click on the class you want to run and choose: Run
2. If not done already, the project will compile and the output of the class will be shown in
a console.
12.
13. PSIMITAB
Learning objectives
● What are the various version of the MITAB formats, their strengh and shortcomings;
● Learn how to find MITAB content from IntAct and other resources;
● Gain practical experience of parsing MITAB.
Exercises
Exercise 1: Programmatic Access using the command line
Nofrills access to the data, using the IntAct REST example URL.
Steps
1. Open the PSICQUIC Registry page in your web browser: http://bit.ly/psicquicregistry;
2. Each PSICQUIC service has an example link to demonstrate a sample REST query,
copy the URL of a service in your clipboard as shows in Figure 2.1 below;
3. Open this URL in the address bar of a new window in your web browser:
4. http://cicblade.dep.usal.es/psicquicws/webservices/current/search/interactor/P0053
3;
5. Your should view MITAB data being downloaded from the APID PSICQUIC service.
Figure 2.1. Copying the REST example URL from the APID PSICQUIC service.
Question 1: Using the command line wget, download MITAB data from IntAct and store
it in a file called intact.tsv
Information
14. Should you not be familiar with wget, type man wgetin a terminal to get access to
documentation. Typically, downloading the content of a URL into a file is done as
follow:
wget “<URL>” O output.txt
Bear in mind that a URL can contain special characters (e.g. &) that can be
conflicting with your command line, to avoid this happening, make sure to surround
your URL with double quotes.
Question 2: Using the grepcommand, count how many interactions involve the protein
P03120.
Information
There are many ways to perform this task but here are a few command line tools that
may help you:
○ cat <file>: prints on the standard output the content of the file given as
parameter;
○ grep <value>: prints on the standard output all lines containing the value
given as parameter;
○ wc l: prints a count of line of the data received on standard input.
UNIX command lines can be chained together using the symbol |so that the output
of the preceding command is given as input to the following one. For instance cat
file.txt | wc l would print the count of line in file.txt
Exercise 2: Using the Java library for MITAB
In this section, your will benefit from using a Java IDE such as Eclipse or IntelliJ to facilitate
the use of Maven, source code compilation, running the exercises...
This java library takes care of parsing MITAB25 data, making each MITAB line parsed into a
Java object:
Question 1: Can you write a class that reads a MITAB data file/stream and print out the
count of interactions parsed?
intact.exercise.chapter2.exercise2.Q1_ReadWholeFile
Question 2: Should you attempt to load the whole content of a data file/stream into
memory could cause problems if the volume of data is large. To facilitate this the class
psidev.psi.mi.tab.PsimiTabReaderalso allows developers to iterate over the data.
Now write an other program (similar to question 2) that implements this more efficient
memory management.
intact.exercise.chapter2.exercise2.Q2_ClientAndMitabBetterMemory
15. Question 3: Now that we have read the content of a MITAB file/stream, we can attempt
to write this content back to a file. Write a program that writes the MITAB content read into a
file.
intact.exercise.chapter2.exercise2.Q3_WriteToFile
Exercise 3: Indexing MITAB data locally using Lucene
Question 1: This library offers the possibility to to index a MITAB dataset using Lucene.
This enables users to run local MIQL queries, thus easing data processing. Write a program
that indexes the provided MITAB data file.
intact.exercise.chapter2.exercise3.Q1_IndexMitabFile
Information
Apache Lucene is a highperformance, fullfeatured text search engine library written
entirely in Java. It is a technology suitable for nearly any application that requires
fulltext search. You can find more information about Lucene on the Apache web site.
A reference to MIQL is provided below (Fig. 2.2.).
Figure 2.2. MIQL fields reference.
Question 2: Write a program that queries the local Lucene index to search for
interaction evidences involving specific molecules. For instance by uniprot identifier O45406
or pubmed id 17129783.
intact.exercise.chapter2.exercise3.Q2_QueryLocalIndexUsingMIQL
16. Question 3: Like MITAB2.6 and higher, the IntAct extended MITAB format does have a
column that describe potential complex expansion that may have been applied to generate
binary interactions. Write a program that reads an IntAct MITAB file and print the following:
● total count of interactions;
● count of spoke expanded interaction;
● count of experimentaly identified binary interaction (i.e. not expanded).
intact.exercise.chapter2.exercise3.Q3_FilterSpokeInteractions
23. Summary
● Two available Java clients exist: Simple and Universal.
● The Simple is a nofrills client, just wrapping the REST URL and returning the results
as a stream.
● The Universal is the original client and more complex. It is based on SOAP and
returns the results in a fullyfledged object model.
● A combination of the Simple and the MITAB library is recommended for maximum
performance and flexibility.
Exercises
Exercise 1: Using the PSICQUIC Universal client
Question 1: Could you write the code to query the interactions for brca2 from IntAct and
print the identifiers for molecule A and B in the console?
intact.exercise.chapter5.exercise1.Q1_PsicquicQuery
Question 2: Access with SOAP to PSICQUIC services has a hard limit of 200
interactions per query. Could you write some code to get all the interactions for pubmed
16189514 from IntAct , which contains more than 2700 interactions?
intact.exercise.chapter5.exercise1.Q2_ProcessLargeDatasets
Information
You will need to write a loop to paginate the results and get them in batches.
Exercise 2: Using the PSICQUIC Simple client
Question 1: Could you write the code to download a MITAB stream from PSICQUIC
using the simple client? Print the MITAB for the publication with pubmed 16189514 from
IntAct in the console.
intact.exercise.chapter5.exercise2.Q1_SimplePsicquicQuery
Question 2: If you wanted to count the results for the above query before loading the
data, what could you do?
intact.exercise.chapter5.exercise2.Q2_CountSimplePsicquicQuery
25. PSICQUIC Service Provider
Learning objectives
In this chapter you will learn how to:
● Create your own PSICQUIC service;
● Start the service locally on the fly;
● Search some data on your custom service.
Create a PSICQUIC Service with your data
There exists a PSICQUIC Reference Implementation (RI) to simplify the installation of new
PSICQUIC Services. The RI just requires a MITAB file as data source, which will be indexed
and made available as a service.
Help How to install a PSICQUIC Service?
Detailed installation steps can be found at the PSICQUIC project wiki:
http://code.google.com/p/psicquic/wiki/HowToInstall
Summary
● A custom PSICQUIC service with your data is straightforward to install thanks to the
Reference Implementation;
● MITAB data is indexed and made available as a web service.
Exercises
You will need to have the MITAB data to create the service in a file. For example:
Question 1: Could you create a file using PSICQUIC with all the interactions for the
publication 16189514 from IntAct in MITAB format?
We will use this data in our PSICQUIC Service. To create the Service you can follow these
steps:
26. Steps
1. Download and untar the Reference Implementation with the following commands:
wget
http://psicquic.googlecode.com/files/psicquicws1.1.6src.tar.gz
tar xfz psicquicws1.1.6src.tar.gz
cd psicquicws1.1.6
2. Index your file using the RI, replacing <MITAB_FILE> with the name of your file. We
will pass as well the path to the index to be created (e.g.
/tmp/mypsicquicindex). This index will be essential for PSICQUIC as it will
contain all the needed data. After running this, the MITAB file could be discarded.
mvn clean compile P createIndex
D psicquic.index=/tmp/mypsicquicindex
D hasHeader=false
D mitabFile=<MITAB_FILE>
3. You should see a BUILD SUCCESS from Maven message after a few seconds.
4. Start the service on the fly using Jetty, using the path to the index::
mvn jetty:run D psicquic.index=/tmp/mypsicquicindex
5. Once the message [INFO] Started Jetty Server appears it means that the
server is available. Don’t stop it (e.g. with Control+C)
6. Open a browser and navigate to http://localhost:8080/psicquicws/webservices/ . If
you see a list of the SOAP and REST services, everything is working fine.
7. The URL http://localhost:8080/psicquicws/webservices/current/search/query/*, for
instance, will show all the data in your service.
Question 2: Using the browser, how many interactions are present in your service for
the query P25786?
Information
If you were an experimentalist or a provider and wanted your service to be publicly
available by everybody you would just need to request inclusion into the Registry.
32. Course summary
In this course you should have learned the following:
● That there are standard formats and tools to gather molecular interaction data;
● These standards have good library support in Java;
● That beyond IntAct, many other database expose their data through PSICQUIC;
● You should feel confident to use data currently available online and use it in the
context of your own work.
Further reading
1. Orchard, S., Kerrien, S., Jones, P., Ceol, A., ChatrAryamontri, A., Salwinski, L., Nerothin,
J., Hermjakob, H. (2007) Submit your interaction data the IMEx way: a step by step guide to
troublefree deposition. 7 Suppl 1, 2834
2. Orchard, S., Salwinski, L., Kerrien, S., MontecchiPalazzi, L., Oesterheld, M., Stümpflen, V.,
Ceol, A., Chatraryamontri, A., Armstrong, J., Woollard, P., et al. (2007) The Minimum
Information required for reporting a Molecular Interaction Experiment (MIMIx) Nat. Biotechnol,
25, 894898
3. Kerrien, S., Orchard, S., MontecchiPalazzi, L., Aranda, B., Quinn, A.F., Vinod, N., Bader,
G.D., Xenarios, I., Wojcik, J., Sherman, D., et al (2007) Broadening the horizonlevel 2.5 of
the HUPOPSI format for molecular interactions. BMC biology, 5, 44
4. Blake, J.A., Harris, M.A. (2008) The Gene Ontology (GO) project: structured vocabularies for
molecular biology and their application to genome and expression analysis. Current protocols in
bioinformatics, 7, 7.2
5. The UniProt Consortium (2009) The Universal Protein Resource (UniProt) 2009. Nucleic acids
research, (37), d169174
6. Degtyarenko, K., Hastings, J., de Matos, P., Ennis, M. (2009) ChEBI: an open bioinformatics
and cheminformatics resource. Current protocols in bioinformatics 14, 14.9
7. Hubbard, T.J., Aken, B.L., Ayling, S., Ballester, B., Beal,K., Bragin,E., Brent, S., Chen,Y.,
Clapham,P., Clarke, L. et al (2009) Ensembl 2009. Nucleic acids research 37, D6907
8. Tateno, Y. (2008) International collaboration among DDBJ, EMBL Bank and GenBank.
Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme 53, 182189
9. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P.,
Das, U., Daugherty, L., Duquenne, L. et al. (2009) InterPro: the integrative protein signature
database. Nucleic acids research 37, D211215
10. Kerrien, S., AlamFaruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C. , Dimmer, E. ,
Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A.,
Lieftink, C., MontecchiPalazzi, L., Orchard, S., Risse, J., Robbe, K., Roechert, B.,
Thorneycroft, D., Zhang, Y., Apweiler, R. and Hermjakob, H. (2007) IntActopen source
resource for molecular interaction data. Nucleic acids research 35, d561565
36. Exercise 2: using the Java library for MITAB
Question 1: Can you write a class that reads a MITAB data file/stream and print out the
count of interactions parsed ?
Answer 1: See intact.solution.chapter2.exercise2.Q1_ReadWholeFile
Question 2: Should you attempt to load the whole content of a data file/stream into
memory could cause problems if the volume of data is large. To facilitate this the class
psidev.psi.mi.tab.PsimiTabReaderalso allows developers to iterate over the data.
Now write an other program (similar to question 2) that implements this more efficient
memory management.
Answer 2: See intact.solution.chapter2.exercise2.Q2_ClientAndMitabBetterMemory
Question 3: Now that we have read the content of a MITAB file/stream, we can attempt
to write this content back to a file. Write a program that writes the MITAB content read into a
file.
Answer 3: See intact.solution.chapter2.exercise2.Q3_WriteToFile
Exercise 3: Indexing MITAB data locally using Lucene
Question 1: This library offers the possibility to to index a MITAB dataset using Lucene.
This enables users to run local MIQL queries, thus easing data processing. Write a program
that indexes the provided MITAB data file.
Answer 1: See intact.solution.chapter2.exercise3.Q1_IndexMitabFile
Question 2: Write a program that queries the local Lucene index to search for
interaction evidences involving specific molecules. For instance by uniprot identifier O45406
or pubmed id 17129783.
Answer 2: See intact.solution.chapter2.exercise3.Q2_QueryLocalIndexUsingMIQL
Question 3: Like MITAB2.6 and higher, the IntAct extended MITAB format does have a
column that describe potential complex expansion that may have been applied to generate
binary interactions. Write a program that reads an IntAct MITAB file and print the following:
● total count of interactions;
● count of spoke expanded interaction;
● count of experimentaly identified binary interaction (i.e. not expanded).
Answer 3: See intact.solution.chapter2.exercise3.Q3_FilterSpokeInteraction
38. Exercise 2 answers: Using PSICQUIC View
Question 1: Now that we got rid of those funny URLs, you can try the difference
between species:human and species:9606using PSICQUIC View. The differences
can be seen more clearly now. DIP is a clear example. Could you explain what is happening?
Answer 1: DIP stores its human interactions using the scientific name “Homo sapiens”,
instead of the common name “human” that IntAct uses. Each provider prepares their own
data for PSICQUIC, which can lead to some discrepancies on what data to show. The
service providers are collaborating to reach a common agreement to what information should
be at least present, hence increasing the compatibility of some specific searches. In our
case, searching for species:9606will always return better results as everybody uses the
taxid.
Chapter 4 answers: The PSICQUIC Registry
Exercise 1 answers: Direct access to the Registry
Question 1: How many services are available?
Answer 1: 16
Question 2: Could you get the same list in XML format?
Answer 2:
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS&format=
xml
Question 3: Could you get the XML information only for the IntAct Service (filter the
other services)?
Answer 3:
http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS&format=x
ml&name=IntAct
Exercise 2 answers: Programmatic access to the Registry
Question 1: Could you list all the PSICQUIC Services and print its name in the console?
Answer 1: See intact.solution.chapter4.exercise2.Q1_ListingAllServices
Question 2: Like in the previous question, but could you print the count of interactions
and the REST URL examples as well?