groovy & grails - lecture 4

Groovy: Efficiency Oriented Programming
Lecture 4
Master Proteomics & Bioinformatics - University of Geneva
Alexandre Masselot - summer 2011

Contents

‣ Eclipse & linux tips
‣ xml parsing
‣ Introduction to Object Oriented Programming

A couple of eclipse tips

‣ team -> local history


- browse by all version of you saved file (by date)


‣ general -> editors -> text editor -> show line number


- lines numbers appear in the editor left margin


- <ctrl>-L + number : jump to the given line number


‣ right button -> source -> toggle comment


‣ right button -> source -> format


- indent & format the select code


- indent & format the select code
- use conjunction with <ctrl>-A to select the whole file

A couple of linux (shell) tips

‣ less filename : more power than ’more filename’
- use arrows to go back and forth the file
- g to jump to top; G to jump to the end; / text to search next occurrence
of a text in the file (n to jump to the following occurrence); h to have
help; q to quit


help; q to quit
‣ wget -O my.fasta http:/http://www.uniprot.org/uniprot/
P41159.fasta
- to save a url target into a local file


help; q to quit
‣ wget -O my.fasta http:/http://www.uniprot.org/uniprot/
P41159.fasta
- to save a url target into a local file
‣ don’t forget to get command documentation with man

XML: introduction (cont’d)

‣ Let's consider an element as a record of:
- atomic number (int)
- symbol (string)
- name (string)
- list of isotopes (int number, double mass and double abundance)


- symbol (string)
- name (string)
‣ Goal : store and retrieve a list of element characteristics


- symbol (string)
- name (string)
‣ Old fashion way, in a text file
12;Carbon;c;12.011;0,12,98.90:1,13.003355,1.10
...


- symbol (string)
- name (string)
12;Carbon;c;12.011;0,12,98.90:1,13.003355,1.10
...
‣ Pros: compact


- symbol (string)
- name (string)
12;Carbon;c;12.011;0,12,98.90:1,13.003355,1.10
...
‣ Pros: compact
‣ Cons: cryptic, hard to add new info or multiple type of
info into one file, problem adding free text etc.

XML: introduction - text storage

12;Carbon;c;12.011;0,12,98.90:1,13.003355,1.10
...

XML: introduction - text storage (cont’d)

‣ Pros:
- compact
- easy edition
- easy sharing (~)

XML: introduction - text storage (cont’d)

‣ Cons:
- cryptic (what field is what)
- error prone
- hard to add new info
- hard to add multiple type of info into one file
- problem adding free text etc.

XML: introduction - RDBM storage

‣ RDBM is another reflex to store structured data
CREATE TABLE elements (atomicNumber int UNIQUE, name name, symbol name UNIQUE, mass double precision);
CREATE TABLE isotopes (atomicNumberBase int REFERENCES elements (atomicNumber), plus int, abundance double
precision, mass double precision);
INSERT into elements VALUES (12, 'Carbon', 'C', 12.011);
INSERT INTO isotopes VALUES(12, 0, 12, 98.90);
INSERT INTO isotopes VALUES(12, 1, 13.003355, 1.10);

XML: introduction - RDBM storage (cont’d)

‣ pros:
- constraints (structure)
- highly controlled
- powerful (large storage)

XML: introduction - RDBM storage (cont’d)

‣ cons:
- can be rather heavy to manage (third parties tools or
libraries)
- HSQLDB or Derby offer light alternative to postgres,
mysql or oracle for small databases
- needs expertise in DB (install/management/update)
- a bit over killer for “small” problems

XML: a versatile alternative

‣ eXtended Markup Language:
<inSilicoDefinitions>
<elements>
<oneElement symbol="H" name="Hydrogen" atomicNumber="1">
<mass monoisotopic="1.007825" average="1.00797594155"/>
<isotopes>
<oneIsotope plus="0" mass="1.007825" abundance="99.985"/>
</isotopes>
</oneElement>
<oneElement symbol="C" name="Carbon" atomicNumber="12">
<isotopes>
</isotopes>
</oneElement>
....
</elements>
<aminoAcids>
<oneAminoAcid name="Alanine" code3="Ala" code1="A">
...
</aminoAcids>
</inSilicoDefinitions>

XML: structure

‣ Header: encode for the charset (change if accent, japanese characters...)
<?xml version="1.0" encoding="ISO-8859-1"?>

XML: structure

‣ Element: a node in the structured tree of the document
<oneElement>...</oneElement>
<isImportant/> // open/close at once

XML: structure

‣ Content
<author>J.R.R. Tolkien</author>
<title><![CDATA[Laurel & Hardy]]></title>

XML: structure

‣ Content
‣ Attribute
<oneElement name="Carbon" symbol="C">
</oneElement>

XML: structure

‣ Content
‣ Attribute
<oneElement name="Carbon" symbol="C">
</oneElement>
‣ Comments

XML: child elements or attributes?

‣ What attribute cannot do?
- no nested structures
- no repeated attributes: <tag name="one" name="two"/> is not allowed
- limited characters, not CDATA for the value

XML: child elements or attributes?

‣ So, when to use attributes instead of elements?
- Attributes could be avoided and only child elements used. Usually, the
main part of the node information should be stored in sub chidren.
However attributes can have a natural use:
- for short information,
- for information that should be read at the element level (before diving
into its children nodes)

XML: cons

‣ File size
- large tag name for storing only one integer value
- list of data stored at once e.g. peak list in the mzXml-like file (base64)

XML: cons

‣ File size

‣ Random access to data no native (data index can be stored)

XML: cons

‣ File size

‣ Random access to data no native (data index can be stored)
‣ Hand edition can be tedious (but possible)

XML: what we will not talk about (today...)

‣ XML file generation
- it is simple text file
- Groovy uses XmlTemplateEngine


‣ XML structure definition
- Schema
- DTD


- Schema
- DTD
‣ XSLT for xml structure transformation


- Schema
- DTD
‣ namespaces


- Schema
- DTD
‣ namespaces
‣ storing binary data


- Schema
- DTD
‣ namespaces
‣ storing binary data
‣ We will focus on retrieving information from an xml file (or String)

XML: parsing challenges

‣ A common need: retrieve partial information (substructure or filtered with
criteria)
- syntax to access to info within the tree
- navigate through all children


criteria)
‣ XML tree can be large (>4GB for uniprot.xml)
- no possible to load everything at once, then scan through memory
- loop across all substructure (e.g. a uniprot entry), then access to
descendant easily and release memory before opening next entry


criteria)
‣ XML tree can be large (>4GB for uniprot.xml)
- no possible to load everything at once, then scan through memory
- loop across all substructure (e.g. a uniprot entry), then access to
descendant easily and release memory before opening next entry
‣ Groovy
- XmlParser: access and modify the tree, but all is in memory. Not
covered here
- XmlSlurper: lazy loading, access through iterators, readonly access

XML: XmlSlurper

‣ XmlSlurper parse a source into a structure ready for traversing
def records=new XmlSlurper().parseText(aStringWithXml)
def records=new XmlSlurper().parse(new File(aStringPath))

XML: XmlSlurper

‣ XmlSlurper parse a source into a structure ready for traversing
def records=new XmlSlurper().parseText(aStringWithXml)
def records=new XmlSlurper().parse(new File(aStringPath))
‣ Parsing is done iterating through NodeChildren
(groovy.util.slurpersupport.NodeChildren), specifying a path &
constraints in the xml tree

XML: XmlSlurper (cont’d)

‣ All the document is not loaded at once, but only the matching subtrees,
one after the other


one after the other
‣ Compared to other xml parser, it corresponds to a mix between DOM and
SAX methods


one after the other
‣ Compared to other xml parser, it corresponds to a mix between DOM and
SAX methods
‣ Perl equivalent would be the module XML::Twig

String xmlText=’’’
<description>

<source type="program">

<name>elementator</name>

<isValidated/>

</source>
</description>
<elements>
<oneElement symbol="H" name="Hydrogen" atomicNumber="1">
<isotopes>
</isotopes>
</oneElement>
<oneElement symbol="He" name="Helium" atomicNumber="3">
<isotopes>
<oneIsotope plus="0" mass="3.016029" abundance=".00014"/>
</isotopes>
</oneElement>
<oneElement symbol="Li" name="Lithium" atomicNumber="6">
<isotopes>
<oneIsotope plus="3" mass="777.016005" abundance="99.999"/> 
</isotopes>
</oneElement>
</elements>
</inSilicoDefinitions>
‘’’

XmlSlurper accessing a single element

‣ Instantiating the slurper
def dataDef = new XmlSlurper().parseText( xmlText )


‣ Single element
<description>



<isValidated/>

</source>
</description>
def source=dataDef.description.source


‣ Single element
<description>



<isValidated/>

</source>
</description>
‣ Reading an attribute
source.@type.text() // -> ‘program’


‣ Single element
<description>



<isValidated/>

</source>
</description>
‣ Reading an attribute
source.@type.text() // -> ‘program’
‣ Reading a sub element contents
source.name.text() // -> ‘elementator’

XmlSlurper: selecting a list

‣ Getting all substructure giving a path will generate an array
dataDef.elements.oneElement // 3 elements

XmlSlurper: selecting a list

‣ Getting all substructure giving a path will generate an array
dataDef.elements.oneElement // 3 elements
‣ Can be visited with closure
insilicoDef.elements.oneElement.each{println it.@symbol}
insilicoDef.elements.oneElement.collect{it.mass.@average.toDouble()}

XmlSlurper: subtree with constraints with findAll

‣ We can add constraints on the element to be actually walked through


‣ Getting oneElement where symbol attribute has 2 characters
insilicoDef.elements.oneElement.findAll{
it.@symbol.text() ==~ /../
}


}

‣ Where mass children has monoisotopic attribute is >6
findAll{ (it.mass.@monoisotopic.text() as BigDecimal) > 6 }


}


‣ Elements with exactly 2 isotopes.oneIsotope children
findAll{ it.isotopes.oneIsotope.size() == 2 }


}


‣ Getting the first element and piping closures
insilicoDef.elements.oneElement[0]
.isotopes.oneIsotope.'@mass'*.toDouble().sum()


}


‣ Getting the first element and piping closures
insilicoDef.elements.oneElement[0]
.isotopes.oneIsotope.'@mass'*.toDouble().sum()
‣ http://groovy.codehaus.org/Reading+XML+using+Groovy%27s

Object Oriented Programming : class

‣ Although Groovy can be seen as a scripting language, with a rather loose
syntax (think of dynamic typing with def), it is is also fully OO


‣ Based on Java, enforcing string OO design, but still adding freedom of
scripting


‣ Based on Java, enforcing string OO design, but still adding freedom of
scripting
‣ The basic component of OOP is a Class, a structure containing
- enforced properties, named fields
- actions, names methods

OOP one example: Date

‣ Like Mr Jourdain, you use OOP...


‣ Instance (on object)
- Date d = new Date()


‣ Setter/getter
- println d.month
- d.year = 2012


‣ Setter/getter
- println d.month
- d.year = 2012
‣ Operation
- Date d2 = d - 1

Object: a constraint map with dedicated operators

28

Class: an introduction

‣ An first example
class Person{

String name // field of type String

Date birth // field of type Date

int age(){ // a method returning an int

return (new Date()).year - birth.year

}
}

Class: an introduction

‣ An first example
class Person{

String name // field of type String

Date birth // field of type Date

int age(){ // a method returning an int

return (new Date()).year - birth.year

}
}
‣ Then a script uses this class
Person guy=new Person(name:'Joe',
birth:new Date('3/12/1980'))

println "$guy.name is ${guy.age()}" // -> Joe is 30

Making a new class

‣ By convention, class name starts with an upper case (when variable usually
start with lower case)

Making a new class

‣ Within a source directory, create a package (a subdirectory structure,
where ‘.’ is a directory path separator), for example
unige.mpb.eop.proteomics.sequence

Making a new class

‣ In eclipse File -> new -> package (<ctrl>-N package)

Making a new class

‣ Within this package, create a class : File -> new -> groovy class and set
name Protein.

Making a new class

‣ Within this package, create a class : File -> new -> groovy class and set
name Protein.

‣ A Protein.groovy file is create with
package unige.mpb.eop.proteomics.sequence
class Protein{
...
}

Class: fields

‣ Fields are properties of a class, they can be dynamically or statically typed
(of any existing type (Integer, List, Map, etc...)

Class: fields

‣ Field are by default associated to an instance of the class (an object
generated by new Person())

Class: fields


‣ They can be changed directly from outside the class (GBean)
guy.name = ‘Jimmy‘
guy.setName(‘Jimmy’)

Class: fields


‣ Or read
println guy.name
println guy.getName()

Class: fields


‣ Or read
println guy.name
println guy.getName()
‣ Inside a method, the instance properties are directly used by their name
int age(){return (new Date()).year - birth.year}

Class: static fields

‣ Fields can also be static i.e. a global value, shared by all instances of the
class (by convention upper case).


‣ If we come back to our previous Person example, we want
static int AGE_LIMIT = 18
boolean canDrive(){return age >= AGE_LIMIT}


‣ If we come back to our previous Person example, we want
static int AGE_LIMIT = 18
boolean canDrive(){return age >= AGE_LIMIT}

‣ We can change this value at once for all Person instances
Person.AGE_LIMIT = 16

Constructor

‣ Thanks to the scripting spirit, there is no need to declare constructor (the
function to instantiate an object)

Constructor

‣ Field can be filled with a map structure
Person girl=new Person(name:‘Marylyn’)

Constructor

‣ It is not compulsory to set all the field

Constructor

‣ It is not compulsory to set all the field
‣ To set default value to a field, you can define it in the property declaration
String name=‘John Doe’

groovy & grails - lecture 4

More Related Content

What's hot

Similar to groovy & grails - lecture 4

More from Alexandre Masselot

Recently uploaded

groovy & grails - lecture 4

Editor's Notes