TPP3_3
Upcoming SlideShare
Loading in...5
×
 

TPP3_3

on

  • 1,085 views

 

Statistics

Views

Total Views
1,085
Views on SlideShare
1,085
Embed Views
0

Actions

Likes
1
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

TPP3_3 TPP3_3 Document Transcript

  • Volume 3, Issue 3 pythonpapers.org
  • Journal Information The Python Papers ISSN: 1834-3147 Editors Co- Edi to r s - in - Chief : Maur ice Ling Tennessee Leeuwenburg Assoc i a t e Edi to r s : Gui lherme Polo Guy Klos s Richard Jones Sarah Mount Stephan ie Chong Referencing Information Art i c l e s f rom th i s edi t i o n of th i s journa l may be referenced as fo l l ows : Author , “Ti t l e” (2008) The Python Papers , Volume N, I s s ue M, Art i c l e Number Copyright Information © Copyr i gh t 2007 The Python Papers and the ind i v i dua l author s Thi s work i s copyr i gh t under the Creat i v e Commons 2.5 l i c ense subjec t to Att r i bu t i o n , Noncommercia l and Share- Al i ke condi t i o n s . The fu l l lega l code may be found at http : / / c r ea t i v e commons.org / l i c en se s / byncsa /2 . 1 / au / The Python Papers was f i r s t publ i s hed in 2006 in Melbourne, Aust ra l i a . Referees An academic peer - rev iew was perfo rmed on al l academic art i c l e s in accordance to The Python Papers Antho logy Edi to r i a l Pol i c y . The rev iewers wi l l be acknowledge ind i v i dua l l y but the i r ident i t i e s wi l l not be re lea sed in order to ensure the anonymity . Focus and Scope * Python User Groups and Spec i a l In te re s t Group in t r oduct i o n s * Technica l aspect s of the Python language * Code rev iews and book rev iews * Descr i p t i o n s of new Python modules and l i b r a r i e s * So lu t i on s to spec i f i c problems in Python * Conso l i d a t ed summaries of current di scus s i o n in Python * Mai l i n g l i s t s or other fora * Companies and organ i s a t i o n s us ing Python * Appl i c a t i o n s developed in Python ( such as held in the Python Cheese Shop) In shor t , we are so l i c i t i n g submiss i o n s where Python i s an integ r a l par t of the answer .
  • The Python Papers Volume 3, Issue 3 3 The Python Papers Anthology Editorial Policy 0. Preamble The Python Papers Anthology is the umbrella entity referring to The Python Papers (ISSN 1834-3147), The Python Papers Monograph (ISSN under application) and The Python Papers Source Codes (ISSN under application), under a common editorial committee (hereafter known as 'editorial board'). It aims to be a platform for disseminating industrial / trade and academic knowledge about Python technologies and its applications. The Python Papers is intended to be both an industrial journal as well as an academic journal, in the sense that the editorial board welcomes submissions relating to all aspects of the Python programming language, its tools and libraries, and community, both of academic and industrial inclinations. The Python Papers aims to be a publication for the Python community at large. In order to cater for this, The Python Papers seeks to publish submissions under two main streams: the industrial stream (technically reviewed) and the academic stream (peer- reviewed). The Python Papers Monograph provides a refereed format for publication of monograph- length reports including dissertations, conference proceedings, case studies, advanced-level lectures, and similar material of theoretical or empirical importance. All volumes published under The Python Papers Monograph will be peer-reviewed and external reviewers may be named in the publication. The Python Papers Source Codes provides a refereed format for publication of software and source codes which are usually associated with papers published in The Python Papers and The Python Papers Monograph. All publications made under The Python Papers Source Codes will be peer-reviewed. This policy statement seeks to clarify the processes of technical review and peer-review in The Python Papers Anthology. 1. Composition and roles of the editorial board The editorial board is headed by the Editor-in-Chief or Co-Editors-in-Chief (hereafter known as "EIC"), assisted by Associate Editors (hereafter known as "AE") and Editorial Reviewers (hereafter known as "ER"). EIC is the chair of the editorial board and together with AEs, manages the strategic and routine operations of the periodicals. ER is a tier of editors deemed to have in-depth expertise knowledge in specialized areas. As members of the editorial board, ERs are accorded editorial status but are generally not involved in the strategic and routine operations of the periodicals although their expert opinions may be sought at the discretion of EIC. 2. Right of submission author(s) to choose streams The submission author(s); that is, the author(s) of the article or code or any submissions in any other forms deemed by the editorial board as being suitable; reserves the right to choose if he/she wants his/her submission to be in the industrial stream, where it will be technically reviewed, or in the academic stream, where it will be peer-reviewed. It is also the onus of the submission author(s) to nominate the stream. The editorial board defaults all submissions to be industrial (technical review) in event of non-nomination by the submission author(s) but the editorial board reserves the right to place such submissions into the academic stream if it deems fit. The editorial board also reserves the right to place submissions nominated for the academic stream in the technical stream if it deems fit. View slide
  • The Python Papers Volume 3, Issue 3 4 3. Right of submission author(s) to nominate potential reviewers The submission author(s) can exercise the right to nominate up to 4 potential reviewers (hereafter known as ";external reviewer";) for his/her submission if the submission author(s) choose to be peer-reviewed. When this right is exercised, the submission author(s) must declare any prior relationships or conflict of interests with the nominated potential reviewers. The final decision to accept the nominated reviewer(s) rests with the Chief Reviewer (see section 5 for further information on the role of the Chief Reviewer). 4. Right of submission author(s) to exclude potential reviewers The submission author(s) can exercise the right to recommend excluding any reasonable numbers of potential reviewers for his/her submission. When this right is exercised, the submission author(s) must indicate the grounds on which such exclusion should be recommended. Decisions for the editorial board to accept or reject such exclusions will be solely based on the grounds as indicated by the submission author(s). 5. Peer-review process Upon receiving a submission for peer-review, the Editor-in-Chief (hereafter known as "EIC") may choose to reject the submission or the EIC will nominate a Chief Reviewer (hereafter known as "CR") from the editorial board to chair the peer-review process of that submission. The EIC can nominate himself/herself as CR for the submission. The CR will send out the submission to TWO or more external reviewers to be reviewed. The CR reserves the right not to call upon the nominated potential reviewers and/or to call upon any of the reviewers nominated for exclusion by the submission author(s). The CR may also concurrently send the submission to one or more Associate Editor(s) (hereafter known as ";AE";) for review. Hence, a submission in the academic stream will be reviewed by at least three persons, the CR and two external reviewers. Typically, a submission may be reviewed by three to four persons: the EIC as CR, an AE, and two external reviewers. There is no upper limit to the number of reviews in a submission. Upon receiving the review from external reviewer(s) and/or AE(s), the CR decides on one of the following options: accept without revision, accept with revision or reject; and notifies the submission author(s) of the decision on behalf of the EIC. If the decision is "accept with revision", the CR will provide a deadline to the submission author(s) for revisions to be done and will automatically accept the revised submission if the CR deems that all revision(s) were done; however, the CR reserves the right to move to reject the original submission if the revision(s) were not carried out by the stipulated deadline by the CR. If the decision is "reject", the submission author(s) may choose to revise for future re-submission. Decision(s) by CR or EIC are final. 6. Technical review process Upon receiving a submission for technical review, the Editor-in-Chief (hereafter known as "EIC") may choose to reject the submission or the EIC will nominate a Chief Reviewer (hereafter known as "CR") from the editorial board to chair the review process of that submission. The EIC can nominate himself/herself as CR for the submission. The CR may decide to accept or reject the submission after reviewing or may seek another AE's opinions before reaching a decision. The CR will notify the submission author(s) of the decision on behalf of the EIC. Decision(s) by CR or EIC is final. 7. Main difference between peer-review and technical review The process of peer-review and technical review are similar, with the main difference being that in the peer review process, the submission is reviewed both internally by the editorial board and externally by external reviewers (nominated by submission author(s) and/or nominated by EIC/CR). In a technical review process, the submission is reviewed by the editorial board. The editorial board retains the right to additionally undertake an external review if it is deemed necessary. View slide
  • The Python Papers Volume 3, Issue 3 5 8. Umbrella philosophy The Python Papers Anthology editorial board firmly believes that all good (technically and/or scholarly/academic) submissions should be published when appropriate and that the editorial board is integral to refining all submissions. The board believes in giving good advice to all submission author(s) regardless of the final decision to accept or reject and hopes that advice to rejected submissions will assist in their revisions. The Python Papers Editorial Statement on Open Access The Python Papers Anthology has received a number of inquiries relating to the republishing of articles from the journal, especially in the context of open-access repositories. Each issue of The Python Papers Anthology is released under a Creative Commons 2.5 license, subject to Attribution, Non-commercial and Share-Alike clauses. This, in short, provides a carte blanche on republishing articles, so long as the source of the article is fully attributed, the article is not used for commercial purposes and that the article is republished under this same license. Creative commons permits both republishing in full and also the incorporation of portions of The Python Papers in other works. A portion may be an article, quotation or image. This means (a) that content may be freely re-used and (b) that other works using The Python Papers Anthology content must be available under the same Creative Commons license. The remainder of this article will address some of the details that might be of interest to anyone who wishes to include issues or articles in a database, website, hard copy collection or any other alternative access mechanism. The full legal code of the license may be found at http://creativecommons.org/licenses/byncsa/2.1/au/ The full open access policy can be found at http://ojs.pythonpapers.org/index.php/tpp/about/editorialPolicies
  • The Python Papers Volume 3, Issue 3 6 Editorial Maurice Ling Hi Everyone, Welcome to the latest issue of The Python Papers. First and foremost, we will like to show our appreciation for all the contributions we had during the year which made us where we are today. Of course, we will not forget all our supporters and readers as well for all your valuable comments. In 2008 (Volume 3), we had published a total of 7 industrial and academic articles each, as well as 2 columns from our regular columnist, Ian Ozsvald, in his ShowMeDo Updates. Thank you for all your support and we will look forward to your continued encouragement. Starting in 2009, all the serials under The Python Papers Anthology will take on a new publishing scheme. We will be releasing each article out to the public as they are being accepted but each issue will be delimited by our usual “issue release” date. The “issue release” date is then our cutoff deadline to prepare the 1-PDF per issue file. This means that we will be serving new articles to everyone much faster than now and there will not be anymore meaningful publication schedules. We had also changed our policy from “Review Policy” to “Editorial Policy” to reflect the changes in the editorial team. We are currently in the process of appointing Editorial Reviewers (ER for short). Editorial Reviewers are members of the editorial committee whom are deemed to have in-depth expertise knowledge in specialized areas. Let's us looking forward to a great year ahead for more Python development and a recovering economy. Happy reading.
  • Editorial: Python at the Crossroads My favourite T-shirt glimpsed at Pycon UK 2008 was ... Python programming as Guido indented it Apart from the two keynote speeches, it was a happy and fascinating event. It was my first Python- only conference and what a pleasure to be able to choose from four streams - web, GUI, testing and the language itself. My previous conferences, all in Australia, were open source events with wider scope and had just a single Python stream among Perl, PHP, Ruby and so on. The quality of speakers was uniformly excellent and the organisation was first rate. We can be sure that EuroPython 2009 being hosted by the same team next year in Birmingham will be definitely worth attending. The two keynotes by Mark Shuttleworth, CEO of Canonical, and Ted Leung, Python Evangelist at Sun, both highlighted Python at the crossroads. Fascinating but not particularly light hearted. Mixing and matching what they said, their combined story is ... 1. Python has critical mass and will continue to grow. The speed of growth is another question. 2. Django has reached the important milestone version 1.0 and should therefore compete with Ruby-on-Rails for newcomers to the Python language itself. 3. Intel and Sun are currently selling multi-core CPUs - 16 and 128 cores respectively. Expect massively multi-core machines in future. 4. Future growth in language popularity will be tied to multi-threading on multi-core CPUs. Haskell is one language expecting a multi-core growth kick. 5. Python's Global Interpreter Lock effectively prevents the language from exploiting current state- of-the-art multi-core computers. Where does this leave beautiful Python? The point was made that a language is chosen for being appropriate for the purpose of a project. Where this happens for multi-core performance reasons and Python is rejected then that is growth for another language and a permanent loss to Python. The Python Papers Editorial Team hopes that many of the Pycon UK session papers will be published in these pages. Please get in touch if you would like to submit an article for academic or technical review. Visit http://ojs.pythonpapers.org to submit an article or paper. In view of the crossroads highlighted for Python at Pycon in September, articles with a focus on multi-threading for multi-core computers would seem to be valuable for the language itself. The Python Papers is keen to see the language succeed and has very talented reviewers ready to help authors get their articles published.
  • Got something to contribute? Please get in touch ... Mike Dewhirst
  • ShowMeDo Update - November Ian Ozsvald In the last issue of the Python Papers I wrote a long article about how ShowMeDo helps you to learn more about Python. Since then we've added another 40 Python videos taking us to almost 380 in total. Including all the open-source topics we cover we have over 800 tutorial videos for you. Much of the content is free, contributed by our great authors and ourselves. Some of the content is in the Club which is for paying members - currently the Club focuses purely on Python tutorials for new and intermediate Python programmers. An update on the Club videos follows later. We were interviewed in October by Ron Stephens of Python411, you'll find the interview and all of Ron's other great Python podcasts on his site: http://www.awaretek.com/python/ Contributing to ShowMeDo: Would you like to share your knowledge with thousands of Python viewers every month? Contributing to ShowMeDo is easy, you'll find guides and links to screencasting software here: http://showmedo.com/addVideoInstructions To get an idea of what is popular with our viewers, see how the videos rank here: http://showmedo.com/mostPopular Remember that everything is previewed by us before publishing. You may have to wait a few days before your video is published but you'll be safe in the knowledge that your content sits alongside other vetted content. We are very keen to help you share your knowledge with our Pythonistas, especially if you want to spread awareness of the tools you like to use. Do get in contact in our forum, our authors are a friendly and very helpful crowd: http://groups.google.com/group/showmedo Free Screencasts: Django: We've had a lot of new Django content recently, mostly from Eric Holscher and ericflo. Eric and Eric have produced an amazing 21 new screencasts to help you learn Django. Django From the Ground Up(13 videos), ericflo http://showmedo.com/videos/series?name=PPN7NA155 Setting Up a Django Development Environment (3 videos), ericflo http://showmedo.com/videos/series?name=LY7fNbpc1 Debugging Django (4 videos), Eric Holscher http://showmedo.com/videos/series?name=RjHhY85GD
  • Django Command Extensions, Eric Holscher http://showmedo.com/videos/series?name=3eB8j5P3b To commemorate the launch of Django v1 I produced a 1-minute quick intro with backing music by the great Django Reinhardt to help raise awareness of the team's great effort: Django In Under A Minute, Ian Ozsvald http://showmedo.com/videos/video?name=3240000&fromSeriesID=324 Python Coding: Florian, a longer-term ShowMeDo author, has created two series which introduce Decorators and teach you how to do unit-testing. Advanced Python(3 videos), Florian Mayer http://showmedo.com/videos/series?name=D42HbAhqD Unit-testing with Python(2 videos), Florian Mayer http://showmedo.com/videos/series?name=TUeY7z7GD Python Tools: We also have videos on the Python Bug Tracker, the Round-up Issue Tracker, using VIM with Python and another in a set explaining how to use Python inside Resolver System's 'Excel-beating' Resolver One spreadsheet. Searching the Python Bug Tracker, A.M. Kuchling http://showmedo.com/videos/video?name=3110000&fromSeriesID=311 An Introduction to Round-up Issue Tracker, Tonu Mikk http://showmedo.com/videos/video?name=3610000&fromSeriesID=361 An Introduction to Vim Macros (7 videos), Justin Lilly http://showmedo.com/videos/series?name=0oSagogCe Putting Python objects in the spreadsheet grid in Resolver One, Resolver Systems http://showmedo.com/videos/video?name=3520000&fromSeriesID=352 Club ShowMeDo: In the Club we continue to create more specialist tutorials for new and intermediate Python programmers. Membership to the Club can either be bought for a year's access or gained free for life if you author a video for us. You'll find details of the 115 Python videos for Club members here: http://showmedo.com/club
  • Lucas Holland has joined us as a Club author having authored many free videos inside ShowMeDo. In this 9-part series he introduced the Python Standard Library: Batteries included - The Python standard library(9 videos), Lucas Holland http://showmedo.com/videos/series?name=o9MBQ758M I have created two new series which walk you through loops, iteration and functions: Python Beginners - Loops and Iteration (7 videos), Ian Ozsvald http://showmedo.com/videos/series?name=tIZs1K8h4 Python Beginners - Functions (6 videos), Ian Ozsvald http://showmedo.com/videos/series?name=4oReffvYq
  • Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research Maurice HT Ling1,2 (mauriceling@acm.org) Christophe Lefevre 1,3,4 (Chris.Lefevre@med.monash.edu.au) Kevin R Nicholas1,4 (kevin.nicholas@deakin.edu.au) 1 CRC for Innovative Dairy Products, Department of Zoology, The University of Melbourne, Australia 2 School of Chemical and Life Sciences, Singapore Polytechnic, Singapore 3 Victorian Bioinformatics Consortium, Monash University, Australia 4 Institute of Technology Research and Innovation, Deakin University, Australia Abstract Background Recent studies have demonstrated that the cyclical nature of mouse lactation 1 can be mirrored at the transcriptome2 level of the mammary glands but making sense of microarray3 results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. Extraction of protein-protein interaction from text by statistical and natural language processing has shown to be useful in managing the literature. Correlations between gene expression across a series of samples is a simple method to analyze microarray data as it was found that genes that are related in functions exhibit similar expression profiles4. Microarrays had been used to examine the transcriptome of mouse lactation and found that the cyclic nature of the lactation cycle as observed histologically is reflected at the transcription level. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. Results Our results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, is generally more stringent than the 99th percentile of Poisson distribution- based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods to coincide with the results from 5-mention PubGene method. However, less than 2% of 1 Lactation is the process of milk production. 2 Transcriptome is the set of genes that are active in a given cell at any one time. 3 Microarray is a multiplex technology used in molecular biology to measure the activity of a set of genes at any one time. 4 A gene expression profile is the trend of activity for all the genes across different time points or conditions. -1-
  • the gene co-expressions analyzed by microarray were found from direct co- occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature. Conclusions We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research. 1. Background Microarray technology is a transcriptome analysis tool which had been used in the study of the mouse lactation cycle (Clarkson and Watson, 2003; Rudolph et al., 2007). A number of advances in microarray analysis have been made recently. For example, inferring the underlying genetic network from microarray results (Rawool and Venkatesh, 2007; Maraziotis et al., 2007) by statistical correlation of gene expression across a series of samples (Reverter et al., 2005), then deriving functional network clusters by mapping onto Gene Ontology (Beissbarth, 2006). It has been shown that functionally related genes demonstrate similar expression profiles (Reverter et al., 2005). These methods have been used to study functional gene sets for basal cell carcinoma (O'Driscoll et al., 2006). The amount of information in published form is increasing exponentially, making it difficult for researchers to keep abreast with the relevant literature (Hunter and Cohen, 2006). At the same time, there has been no study to demonstrate that the current status of knowledge in protein-protein interactions in the literature is useful to increase the understanding of microarray data. The two major streams for biomedical protein-protein information extraction are natural language processing (NLP) and co-occurrence statistics (Cohen and Hersh, 2005; Jensen et al., 2006). The main reason for concurrent existence of these two methods is their complementary effect in terms of information extraction (Jensen et al., 2006). NLP has a lower recall or sensitivity than co-occurrence but tends to be more precise compared with co-occurrence statistical methods (Wren and Garner, 2004; Jensen et al., 2006). Mathematically, precision is the number of true positives divided by the total number of items labeled by the system as positive (number of true positives divided by the sum of true and false positives), whereas recall is the number of true positives identified by the system divided the number of actual positives (number of true positives divided by the sum of true positives and false negatives). A number of tools have approached protein-protein interaction extraction from the NLP perspective, these include GENIES (Friedman et al., 2001), MedScan (Novichkova et al., 2003), PreBIND (Donaldson et al., 2003), BioRAT (David et al., 2004), GIS (Chiang et al., 2004), CONAN (Malik et al., 2006), and Muscorian (Ling et al., 2007). Muscorian (Ling et al., 2007) achieved at least 82% precision and 30% in recall (sensitivity). NLP methods made use of the grammatical forms of words and structure of a valid sentence to identify the grammatical roles of each word in a sentence, parse the sentence into phrases and extracting information such as subject-verb-object structures from these phrases. Co-occurrence, a statistical method, is based on the thesis that multiple occurrences of the same pair of entities suggests that the pair of -2-
  • entities are related in some way and the likelihood of such relatedness increases with higher co-occurrence. In another words, co-occurrence methods tend to view the text as a bag of un-sequenced words. Hence, depending on the threshold allowed, which will translate to the precision of the entire system, recall could be total, as implied in PubGene (Jenssen et al., 2001). PubGene (Jenssen et al., 2001) defined interactions by co-occurrence to the simplest and widest possible form by assigning an interaction between 2 proteins if these 2 proteins appear in the same article just once in the entire library of 10 million articles and found that this criterion has 60% precision (1-Mention PubGene method). Although it was not stated in the article (Jenssen et al., 2001), it is obvious that such a criterion would yield 100% recall or sensitivity, giving an F-score of 0.75. F-score is defined as the harmonic mean of precision and recall, attributing equal weight to both precision and recall. However, 60% precision is usually unsatisfactory for most applications. PubGene (Jenssen et al., 2001) had also defined a “5-Mention” method which requires 5 or more articles with 2 protein names to assign an interaction with 72% precision. It is generally accepted that precision and recall are inversely related; hence, it can be expected that the “5-Mention” method will not be 100% sensitive. However, PubGene was benchmarked against the Database of Interacting Proteins and OMIM, making it more difficult to appreciate the statistical basis of “1-Mention” and “5-Mention” methods as compared to using a hypothesis testing framework in Chen et al. (2008). In addition, PubGene is unable to extract the nature of interactions, for example, binding or inhibiting interactions. On the other hand, NLP is designed to extract the nature of interactions (Malik et al., 2006; Ling et al., 2007); hence, it can be expected that NLP results may be used to annotate co-occurrence results. CoPub Mapper used a more sophisticated information measure which took into account the distribution of entity names in the text database (Alako et al., 2005). Although Alako et al (2005) demonstrated CoPub Mapper's information measure co- relates well with microarray co-expression, the information measure was not used as a decision criterion for deciding which pairs of co-occurrences were positive results (personal communication, Guido Jenster, 2006). This is unlike 1-Mention PubGene method where all co-occurrence were taken as positive result and 5-Mention PubGene method requires at least 5 count of co-occurrence before attributing the co-occurrence as a positive result. Chen et al. (2008) used chi-square to test co-occurrence statistically to mine disease-drug interactions from clinical notes and published literature. Another possible way to calculate co-occurrence is a direct use of Poisson distribution on the assumption that co-occurrence of 2 protein names is a rare chance with respect to the entire library. Poisson distribution is a discrete distribution similar to Binomial distribution but is used for rare events, for example, to estimate the probability of accidents in a given stretch of road in a day. Poisson distribution is easier to use than Binomial distribution as it only requires the mean and does not require a standard deviation. Based on PubGene, the statistical assumption of Poisson distribution-based statistics requiring rare events (in this case, the co-occurrences of 2 protein names in a collection of text is statistically rare) can generally be held (Jenssen et al., 2001). Although a combination of either NLP or co-occurrence in microarray analysis have been used (Li et al., 2007; Gajendran et al., 2007; Hsu et al., 2007), neither method had been used in microarray analysis for advancing lactational biology. This study -3-
  • attempts to examine the relation between the PubGene and Poisson distribution methods of calculating co-occurrence and explore the use of NLP-based protein- protein interaction extraction results to annotate co-occurrence results. This study also examines the use of co-occurrence analysis on 4 publically available microarray data sets on mouse lactation cycle (Master et al., 2002; Clarkson and Watson, 2003; Stein et al., 2004; Rudolph et al., 2007) as a novel hypothesis discovery tool. Master et al. (2002) used 13 microarrays to discover the presence of brown adipose tissue in mouse mammary fat pad and its role in thermoregulation, Clarkson and Watson (2003) used 24 microarrays and characterized inflammation response genes during involution, Stein et al. (2004) used 51 microarrays and discovered a set of 145 genes that are up- regulated in early involution where 49 encoded for immunoglobulins, and Rudolph et al. (2007) used 29 microarrays to study lipid synthesis in the mouse mammary gland following diets of various fat content and found that genes encoding for nutrient transporter into the cell are up-regulated following increased food intake. More importantly, each of the 4 studies independently demonstrated that the cyclical nature of mammary gland development, as observed histologically and biochemically, are reflected at the transcriptome level suggesting that microarray is a suitable tool to study the regulation of mouse lactation. It should be noted that even-though each of these microarray experiments were designed for different purposes, the principle that co-expressed genes are more functionally correlated than functionally unrelated genes remains, as demonstrated by Reverter et al. (2005). Our results demonstrate that 5-mention PubGene method is generally statistically more significant than 99th percentile of Poisson distribution method of calculating co- occurrence. Our results showed that 96% of the interactions extracted by NLP methods (Ling et al., 2007) overlapped with the results from 5-mention PubGene method. However, less than 2% of the microarray correlations were found in the co- occurrence graph extracted by 1-mention PubGene method. Using co-occurrence results to filter microarray co-expression correlations, we have discovered a potentially novel set of 7 protein-protein interactions that had not been previously described in the literature. 2. Methods 2.1. Microarray Datasets The 4 microarray datasets are from Master et al. (2002) using Affymetrix Mouse Chip Mu6500 and FVB mice, Clarkson and Watson (2003) using Affymetrix U74Av2 chip and C57/BL6 mice, Rudolph et al. (2007) using Affymetrix U74Av2 chip and FVB mice, and Stein et al. (2004) using Affymetrix U74Av2 chip and Balb/C mice. 2.2. Co-Occurrence Calculations Using a pre-defined list of 3653 protein names which was derived by Ling et al. (2007) from Affymetrix Mouse Chip Mu6500 microarray probeset, PubGene established 2 measures of binary co-occurrence (Jenssen et al., 2001): 1-mention method and 5 mentions method. In the 1-mention method, the appearance of 2 entity names in the same abstract will be deemed as a positive outcome whereas the 5 mentions method will require the appearance of 2 entity names in at least 5 abstracts before considered positive. -4-
  • For co-occurrence modelled on Poisson distribution (Poisson co-occurrence), the number of abstracts in which both entity names appeared in is assumed to be rare as it only requires the appearance of 2 entity names within 5 articles in a collection of 10 million articles to give a precision of 0.72 (Jenssen et al., 2001). The relative occurrence frequencies of each of the 2 entities were calculated separately as a quotient of the number of abstracts in which an entity name appeared in and the total number of abstracts in the corpus. The product of relative occurrence frequency of each of the 2 entities can be taken as the mean expected probability of the 2 entities appearing in the same abstract if they are not related, which when multiplied by the total number of abstracts, can be taken as the mean number of occurrence (lambda) of Poisson distribution. For example, if proteinA and proteinB are found in 1000 abstracts each and there are 1 million abstracts, the relative occurrence frequency will be 0.001 each and the mean number of occurrence will be 1 (0.001 2 x 1000000). This means that we expect 1 abstract in a collection of 1 million to contain proteinA and proteinB if they are not related (n = 1, p = 0.5). A positive result is where the number of abstracts in which both the 2 entities in question appeared on or above the 95th (one-tail P < 0.05) or 99th (one-tail P < 0.01) percentile of the Poisson distribution. In both co-occurrence calculations, entity (protein) names in text is recognized by pattern matching , as used in Ling et al. (2007). 2.3. Comparing Co-Occurrence and Text Processing Two sets of comparisons were performed: within the different forms of co-occurrence, and between co-occurrence and text processing methods. The first set of comparison aims to evaluate the differences between the 3 co-occurrence methods described above. PubGene's 1-mention and 5-mentions methods were co-related singly and in combination with Poisson co-occurrence methods. Given that the nodes (N) of a co-occurrence network represents the entities and the links or edges (E) between each node to represent a co-occurrence under the method used, the entire co-occurrence graph (G) = {N, E}, that is, a set of nodes and a set of edges. In addition, given that the same set of entities were used (same set of nodes), the differences between the 2 graphs resulted from 2 co-occurrence methods can then be simply denoted as the number of differences between the 2 sets of edges (subtraction of one set of edges with another set of edges). In practice, a total space model is used. A graph of total possible co-occurrence is where each node is “linked” or co-occurred with every node, including loops (edge to itself). Thus, a graph of total possible co-occurrence has 3653 nodes and 12694969 (35632) edges. We define a graph, G*, as the undirected graph of total possible co-occurrence without parallel edges including loops. G* has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2] edges. The output graph of each co-occurrence method is reduced to the number of edges it contains as it can be assumed that the graph from 1-mention PubGene method represents the most liberal co-occurrence graph (GPG1), the resulting graph from any other more sophisticated method (Gi where i denotes the co-occurrence method) will be a proper subset of GPG1 and certainly G*. -5-
  • The second set of comparison aims at correlating co-occurrence techniques and natural language processing techniques for extracting interactions between two entities, such as two proteins. In this comparison, the extracted protein-protein binding and activation interactions, extracted using Muscorian on 860000 published abstracts using “mouse” as the keyword as previously described (Ling et al., 2007), has been used to compare against co-occurrence network of 1-Mention PubGene and 5-Mention PubGene by graph edges overlapping as described above. Briefly, Muscorian (Ling et al., 2007) normalized protein names within abstracts by converting the names into abbreviations before processing the abbreviated abstracts into a table of subject-verb-objects. Protein-protein interaction extractions were carried out by matching of each of the 12694969 (35632) pairs of protein names and verb, namely, activate or bind, in the extracted table of subject-verb-objects. 2.4. Mapping Co-Expression Networks onto Text-Mined Networks A co-expression network was generated from each of the 4 in vivo data sets by pair- wise calculation of Pearson's coefficient on the intensity values across the dataset, where a coefficient of more than 0.75 or less than -0.75 signifies the presence of a co- expression between the pair of signals on the microarray (Reverter et al., 2005). The co-expression network generated from Master et al. (2002) and an intersected co- expression network generated by intersecting all 4 networks were used to map onto 1- PubGene and NLP-mined networks. For the co-expression network generated from Master et al. (2002), a 0.01 coefficient unit incremental stepwise mapping to 1- PubGene co-occurrence network as performed from 0.75 to 1 to analyze for an optimal correlation coefficient to derive a set of correlations between genes that is likely to have not been studied before (not found in 1-PubGene co-occurrence network). 3. Results 3.1. Comparing Co-Occurrence Calculation Methods Using 3563 transcript names, there is a total of 6345703 possible pairs of interactions - 927648 (14.6%) were found using 1-Mention PubGene method and 431173 (6.80%) were found using 5-Mention PubGene method. The Poisson co-occurrence method using both 95th or 99th percentile threshold found 927648 co-occurrences, which is the same set as using 1-Mention PubGene method. The mean number of co-occurrence, which is used as the mean of the Poisson distribution, is calculated as the product of the probability of occurrence of each of the entity names in the database. Using a database of 100 thousand abstracts as an example, if 500 abstracts contained the term “insulin” (500 abstracts in 100 thousand, or 0.5%) and 200 abstracts contained the term “MAP kinase” (200 abstracts in 100 thousand, or 0.2%), then the mean number of co-occurrence (lambda in Poisson distribution) is 0.001%. The range of mean number of co-occurrence for the 6345703 pairs of entities were from zero to 0.59, with mean of 0.000031. For example, if the mean is 3.1 x 10-5, then the probability of an abstract mentioning 2 proteins not related in any functional way is 4.8 x 10-10 or virtually zero in 6.3 million possible interactions. These results are summarized in Table 1. -6-
  • Number of Clone-Pairs % of Full Combination Full Combination (G*)1 6345703 100.00 1-Mention PubGene 927648 14.62 5-Mention PubGene 431173 6.80 Poisson Co-occurrence at 95th percentile 9276482 14.62 th 2 Poisson Co-occurrence at 99 percentile 927648 14.62 Table 1 - Summary results of co-occurrence using PubGene or Poisson distribution 1 The undirected graph of total possible co-occurrence (35632) without parallel edges excluding self edge, which has 3653 nodes and 63457030 [3563 x (3563 – 1) / 2] edges. 2 Same set as 1-Mention PubGene 3.2. Comparison of Natural Language Processing and Co-Occurrence Natural language processing (NLP) techniques were used to extract protein-protein binding interactions and protein-protein activation interactions from almost 860000 abstracts as described in Ling et al. (2007). A total of 9803 unique binding interactions and 11365 unique activation interactions were identified, of which 2958 were both binding and activation interactions. Of the 9803 binding interactions, 9661 interactions concurred with 1-Mention PubGene method (98.55%) and 9465 interactions with 5-Mention PubGene method (96.54%). Of the 11365 activation interactions, 11280 interactions and 11111 interactions concurred with 1-Mention PubGene method (99.25%) and 5-Mention PubGene method (97.77%) respectively. Hence, of the 927648 interactions found using 1-Mention PubGene method, 1.04% (n = 9661) were binding interactions and 1.22% (n = 11280) were activation interactions. Furthermore, of the 431173 interactions found using 5-Mention PubGene method, 2.20% (n = 9465) of the interactions were binding interactions and 2.58% (n = 11111) were activation interactions. Combining binding and activation interactions (n = 18120), 1.96% of 1-Mention PubGene co-occurrence graph and 3.85% of 5-Mention PubGene co-occurrence graph were annotated respectively. 3.3. Mapping Co-Expression Networks onto Text-Mined Networks Using Pearson's correlation coefficient to signify the presence of a co-expression between the pair of spots (genes) on the Master et al. (2002) data set, there are 210283 correlations between -1.00 to -0.75 and 0.75 to 1.00, of which 2014 (0.96% of correlations) are found in 1-PubGene co-occurrence network, 342 (0.16% of correlations) are found in activation network extracted by natural language processing means and 407 (0.19% of correlations) are found in binding network extracted by natural language processing means. -7-
  • From incremental correlation mapping with 1-PubGene network (tabulated in Table 2 and graphed in Figure 1), there is a decline of the number of correlations from 208269 (correlation coefficient of 0.75) to 7 (correlation coefficient of 1.00). The percentage of overlap between co-occurrence and co-expression rose linearly from correlation coefficient of 0.75 to 0.85 (r = 0.959) while that of correlation coefficient of 0.86 to 0.92 is less correlated (r = 0.223). The 7 pairs of correlations in Master et al. (2002) data set with correlation coefficient of 1.00 are; lactotransferrin (Mm.282359) and solute carrier family 3 (activators of dibasic and neutral amino acid transport), member 2 (Mm.4114); B-cell translocation gene 3 (Mm.2823) and UDP- Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 1 (Mm.15622); gamma- glutamyltransferase 1 (Mm.4559) and programmed cell death 4 (Mm.1605); FK506 binding protein 11 (Mm.30729) and signal recognition particle 9 (Mm.303071); FK506 binding protein 11 (Mm.30729) and Ras-related protein Rab-18 (Mm.132802); casein gamma (Mm.4908) and casein alpha (Mm.295878); G protein-coupled receptor 83 (Mm.4672) and recombination activating gene 1 activating protein 1 (Mm.17958). The amount of overlap between microarray correlations and 1-mention PubGene co- occurrence increased steadily from 0.96% at the correlation coefficient of 0.75 to 1.057% at the correlation coefficient of 0.87. Mapping an intersect of co-expression networks of all 4 in vivo data sets (Master et al., 2002; Clarkson and Watson, 2003; Stein et al., 2004; Rudolph et al., 2007), there are 1140 correlations, of which 14 (1.23%) are found in 1-PubGene co-occurrence network, none of which corresponds to the interactions found in activation or binding networks extracted by natural language processing means (Ling et al., 2007). Intersect of Correlation and 1-Mention PubGene 1.10 Found in 1-Mention PubGene Percent of Correlations 1.05 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.95 0.97 0.98 Minimum Correlation Figure 1 – Percentage of correlation network analyzed from Maser et al. (2002) are found in 1-Mention PubGene co-occurrence -8-
  • Minimum Number of Correlations Number of Correlations Percentage of Correlation in Master et al. (2002) found in 1-PubGene Correlations Found 0.75 210283 2014 0.958 0.76 207593 1983 0.964 0.77 181383 1735 0.966 0.78 157622 1495 0.958 0.79 136152 1316 0.976 0.80 116775 1141 0.987 0.81 99276 970 0.987 0.82 83802 823 0.988 0.83 70019 692 0.998 0.84 57872 575 1.004 0.85 47453 472 1.005 0.86 38228 373 0.985 0.87 30347 314 1.046 0.88 23740 234 0.995 0.89 18137 178 0.991 0.90 13435 138 1.038 0.91 9797 96 0.990 0.92 6849 70 1.034 0.93 4580 40 0.881 0.94 2919 28 0.969 0.95 1742 14 0.984 0.95 970 7 0.727 0.97 472 4 0.855 0.98 197 2 1.026 0.99 60 0 0.000 1.00 7 0 0.000 Table 2 - Summary of incremental stepwise mapping of correlation coefficients from Master et al. (2002) to 1-PubGene co-occurrence network 4. Discussion Comparing the difference between PubGene (Jenssen et al., 2001) and Poisson modelling method for co-occurrence calculations, three observations could be made. Firstly, one of the common criticisms of a simple co-occurrence method as used in this study (co-occurrence of terms without considering the number of words between -9-
  • these terms) is that given a large number of articles or documents, every term will co- occur with every term at least once, leading to total possible co-occurrence (100% or 12694969 in this case). Our results showed that 7.31% of the total possible co- occurrence were actually found using about 860000 abstracts and only 3.40% using a more stringent method. PubGene (Jenssen et al., 2001) has also suggested that total possible co-occurrence was not evident with a much larger set of articles (10 million) and yet achieved 60% precision using only one instance of co-occurrence in 10 million articles (1-Mention PubGene) and 72% precision with 5-Mention PubGene. It can be expected with more instances of co-occurrence, precision may be higher. This might be due to the sparse distribution of entity names in the set of text as observed from the low mean number of co-occurrence used for Poisson distribution modeling. At the same time, PubGene (Jenssen et al., 2001) also illustrated that entity name recognition by simple pattern matching is able to yield quality results. Using only results from PubGene (Jenssen et al., 2001), it can be concluded that total possible co-occurrence is unlikely for a corpus size of up to 10 million (more than half of current PubMed). Using the Poisson distribution, the mean number of co- occurrence can be expected to decrease with a larger corpus than used in this study as it is a product of the relative frequencies of each of the 2 entities. This suggests that as the size of corpus increases, it is likely that each co-occurrence of terms is more significant, suggesting that a statistical measure might be more useful in a very large corpus of more than 10 million as it takes into account both frequencies and corpus size. Secondly, Poisson co-occurrence methods at both 95th and 99th percentile yield the same set of results as 1-Mention PubGene method, which is expected as the maximum mean number of co-occurrence is 0.59. This implied that every co-occurrence found are essentially statistically significant in a corpus of about 860000 abstracts; thus, providing statistical basis for “1-Mention PubGene” method. This might be due to the nature of abstracts, which were known to be concise. Proteins that have no relation to each other are generally unlikely to be mentioned in the same abstract and abstracts tends to mention only crucial findings. However, the same might not apply if full text articles are used – un-related proteins could be used solely for illustrative purposes. Thirdly, the number of co-occurrences found using 5-Mention PubGene method is substantially lower (less than half) of that by 1-Mention PubGene method which was also shown in Jenssen et al. (2001). This suggested that 5-Mention PubGene is appreciably more stringent than using Poisson co-occurrence at 99th percentile; thus, providing statistical basis for “5-Mention PubGene” method. Our results comparing the numbers of co-occurrence demonstrated a 50.79% decrease in co-occurrence from 1-Mention PubGene network to 5-Mention PubGene network. However, the 5-Mention PubGene network retained most of the “activation” (98.5%) and “binding” (98.0%) interactions found in 1-Mention PubGene network. This might be the consequence of 30% recall of the NLP methods (Ling et al., 2007) as it would usually require 3 or more mentions to have a reasonable chance to be identified by NLP methods. This might also be due to the observation that the 5-Mention PubGene method is more precise, in terms of accuracy, than the 1-PubGene method as shown in Jenssen et al. (2001). - 10 -
  • The probability of a true interaction (Ling et al., 2007) existing in each of the 9661 NLP-extracted binding interactions that are also found in 1-Mention PubGene co- occurrence would be raised. The probability of a true interaction existing in each of the 9465 NLP-extracted binding interactions that are also found in 5-Mention PubGene co-occurrence would be higher. Hence, combining NLP and statistical co- occurrence techniques can improve the overall confidence of finding true interactions. However, it should be noted that statistical co-occurrence used in this work cannot raise the confidence of NLP-extracted interactions. Nevertheless, these results also suggest that graphs of statistical co-occurrence could be annotated with information from NLP methods to indicate the nature of such interactions. In this study, 2 NLP-extracted interactions from Ling et al. (2007), “binding” and “activation”, were combined. The combined “binding” and “activation” network covered 1.96% and 3.85% of 1-Mention and 5-Mention PubGene co- occurrence graph respectively. Our results demonstrate that the combined network has a higher coverage than individual “binding” or “activation” networks. Thus, it can be reasonable to expect that with more forms of interactions, such as degradation and phosphorylation, extracted with the same NLP techniques, the co-occurrence graph annotation would be more complete. By overlapping the co-expression network analyzed from Master et al. (2002) data set to 1-Mention PubGene co-occurrence network, our results demonstrated that about 99% of the co-expression was not found in the co-occurrence network. This might suggest that the choice of Pearson's correlation coefficient threshold of more than 0.75 and less than -0.75 as suggested by Reverter et al. (2005) is likely to be sensitive in isolating functionally related genes from microarray data at the cost of reduced specificity. Our results from incremental stepwise analysis showed that the percentage of overlap between co-expression and co-occurrence rose linearly from correlation coefficient from 0.75 to 0.85. This suggests that a correlation coefficient of 0.85 may be optimal for this data set as it is likely that using the correlation coefficient of 0.85 will result in less false positives than the correlation coefficient of 0.75. At the same time, increasing the correlation coefficient from 0.75 to 0.85 resulted in 77.4% less (47453 correlations from 210283) interaction correlations. Using this method to further describe protein-protein interactions and to generate new hypotheses, it can be argued that correlation coefficient of 0.85 will result in less false positives. While this deduction is likely as a more stringent criterion tends to reduce the rate of false positives, it is difficult to prove experimentally without exhaustive examination of each result. Nevertheless, the result suggest the possibility of using the inverse linearity of correlation coefficient and the number of gene co-expressions as a preliminary visual assessment to gauge an optimal correlation coefficient to use for a particular data set. However, on the extreme end, a correlation coefficient of 0.99 and 1.00 yielded 60 and 7 correlations respectively in Master et al. (2002) data set but none was found in 1-Mention PubGene co-occurrence network. This suggests that high-throughput genomic techniques such as microarrays, present a vast amount of un-mined biological information that had not been examined experimentally. By exploring the literature for the biological significance for each of the 7 pairs of perfectly co-expressed genes using Swanson's method (Swanson, 1990), it was found - 11 -
  • that all 7 pairs were biologically significant. Lactotransferrin (Ishii et al., 2007) and solute carrier family 3 (activators of dibasic and neutral amino acid transport), member 2 (Feral et al., 2005) were involved in cell adhesion. B-cell translocation gene 3 (Guehenneux et al., 1997) and UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 1 (Mori et al., 2004) were involved in cell cycle control. Casein gamma and casein alpha are well-established components of milk. Gamma-glutamyltransferase 1 (Huseby et al., 2003) and programmed cell death 4 (Frankel et al., 2008) were known to be regulating apoptotic pathways. Rab18 (Vazquez-Martinez et al., 2007), signal recognition particle 9 (Egea et al., 2004) and FK506 binding protein 11 (Dybkaer et al., 2007) were known to be involved in the secretory pathway. G protein-coupled receptor 83 (Lu et al., 2007) and recombination activating gene 1 activating protein 1 (Igarashi et al., 2001) were known to be involved in T-cell function. Taken together, these suggest that the set of 7 correlations have not likely been described and may prove to be valuable new hypotheses in the study of mouse mammary physiology. It is also plausible that this argument can be extended to the set of 53 highly co-expressed genes (0.99 < correlation coefficient < 1.00). Intersecting the 4 in vivo data sets into a co-expression network increases the power of the analysis as it represents correlation among gene expression that are more than 0.75 or less than -0.75 in all 4 data sets. There were 1140 examples of co-expression in this intersect and only 14 co-expressions (1.23%) were found in the one-mention PubGene co-occurrence network, but none in either the binding or activation networks extracted by natural language processing. This suggests that these 14 co-expressions are neither binding nor activating interactions. Textpresso (Muller et al., 2004) had defined a total of 36 molecular associations between 2 proteins which includes binding and activation. Future work will expand NLP mining to 34 other interactions to improve the annotation of co-occurrence networks. Reverter et al. (2005) had previously analysed 5 microarray data sets by expression correlation and demonstrated that genes of related functions exhibit similar expression profile across different experimental conditions. Our results suggest 1126 co- expressed genes across 4 microarray data sets are not found in the co-occurrence network. This may be a new set of valuable information in the study of mouse mammary physiology as these pairs of genes have not been previously mentioned in the same publication and experimental examination of these potential interactions is needed to understand the biological significance of these co-expressions. 5. Conclusions We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method. In this study, we demonstrate the use of a liberal co-occurrence-based literature analysis (1-Mention PubGene method) to represent the state of research knowledge in functional protein-protein interactions as a sieve to isolate potentially novel hypotheses from microarray co-expression analyses for further research. - 12 -
  • Authors' contributions ML, CL and KRN contribute equally to the design of experiments and analysis of results. ML carries out the experiments. References 1. Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005, 6(1):51. 2. Beissbarth T: Interpreting experimental results using gene ontologies. Methods in Enzymology 2006, 411:340-352. 3. Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C: Automated Acquisition of Disease Drug Knowledge from Biomedical and Clinical Documents: An Initial Study. Journal of the American Medical Informatics Association 2008, 15(1):87-98. 4. Chiang J-H, Yu H-C, Hsu H-J: GIS: a biomedical text-mining system for gene information discovery. Bioinformatics 2004, 20(1):120. 5. Clarkson RWE, Watson CJ: Microarray analysis of the involution switch. Journal of Mammary Gland Biology and Neoplasia 2003, 8(3):309-319. 6. Cohen AM, Hersh WR: A survey of current work in biomedical text mining. Briefings in Bioinformatics 2005, 6(1):57-71. 7. David PAC, Bernard FB, William BL, David TJ: BioRAT: extracting biological information from full-length papers. Bioinformatics 2004, 20(17):3206. 8. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K et al: PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003, 4:11. 9. Dybkaer K, Iqbal J, Zhou G, Geng H, Xiao L, Schmitz A, d'Amore F, Chan WC: Genome wide transcriptional analysis of resting and IL2 activated human natural killer cells: gene expression signatures indicative of novel molecular signaling pathways. BMC Genomics 2007, 8:230. 10. Egea PF, Shan SO, Napetschnig J, Savage DF, Walter P, Stroud RM: Substrate twinning activates the signal recognition particle and its receptor. Nature 2004, 427(6971):215-221. 11. Feral CC, Nishiya N, Fenczik CA, Stuhlmann H, Slepak M, Ginsberg MH: CD98hc (SLC3A2) mediates integrin signaling. Proceedings of the National Academy of Science U S A 2005, 102(2):355-360. 12. Frankel LB, Christoffersen NR, Jacobsen A, Lindow M, Krogh A, Lund AH: Programmed cell death 4 (PDCD4) is an important functional target of the microRNA miR-21 in breast cancer cells. Journal of Biological Chemistry 2008, 283(2):1026-1033. 13. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural- language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17(Suppl. 1):S74-S82. 14. Gajendran VK, Lin JR, Fyhrie DP: An application of bioinformatics and text mining to the discovery of novel genes related to bone biology. Bone 2007, 40(5):1378-1388. - 13 -
  • 15. Guehenneux F, Duret L, Callanan MB, Bouhas R, Hayette S, Berthet C, Samarut C, Rimokh R, Birot AM, Wang Q et al: Cloning of the mouse BTG3 gene and definition of a new gene family (the BTG family) involved in the negative control of the cell cycle. Leukemia 1997, 11(3):370-375. 16. Hsu CN, Lai JM, Liu CH, Tseng HH, Lin CY, Lin KT, Yeh HH, Sung TY, Hsu WL, Su LJ et al: Detection of the inferred interaction network in hepatocellular carcinoma from EHCO (Encyclopedia of Hepatocellular Carcinoma genes Online). BMC Bioinformatics 2007, 8:66. 17. Hunter L, Cohen KB: Biomedical language processing: what's beyond PubMed? Molecular Cell 2006, 21(5):589-594. 18. Huseby NE, Asare N, Wetting S, Mikkelsen IM, Mortensen B, Sveinbjornsson B, Wellman M: Nitric oxide exposure of CC531 rat colon carcinoma cells induces gamma-glutamyltransferase which may counteract glutathione depletion and cell death. Free Radical Research 2003, 37(1):99-107. 19. Igarashi H, Kuwata N, Kiyota K, Sumita K, Suda T, Ono S, Bauer SR, Sakaguchi N: Localization of recombination activating gene 1/green fluorescent protein (RAG1/GFP) expression in secondary lymphoid organs after immunization with T-dependent antigens in rag1/gfp knockin mice. Blood 2001, 97(9):2680-2687. 20. Ishii T, Ishimori H, Mori K, Uto T, Fukuda K, Urashima T, Nishimura M: Bovine lactoferrin stimulates anchorage-independent cell growth via membrane-associated chondroitin sulfate and heparan sulfate proteoglycans in PC12 cells. Journal of Pharmacological Science 2007, 104(4):366-373. 21. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nature Review Genetics 2006, 7(2):119-129. 22. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 2001, 28(1):21-28. 23. Li X, Chen H, Huang Z, Su H, Martinez JD: Global mapping of gene/protein interactions in PubMed abstracts: A framework and an experiment with P53 interactions. Journal of Biomedical Informatics 2007. 24. Ling MH, Lefevre C, Nicholas KR, Lin F: Re-construction of Protein- Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. In: Second IAPR Workshop on Pattern Recognition in Bioinformatics (PRIB 2007). Singapore: Springer-Verlag; 2007. 25. Lu LF, Gavin MA, Rasmussen JP, Rudensky AY: G protein-coupled receptor 83 is dispensable for the development and function of regulatory T cells. Molecular Cell Biology 2007, 27(23):8065-8072. 26. Malik R, Franke L, Siebes A: Combination of text-mining algorithms increases the performance. Bioinformatics 2006, 22(17):2151-2157. 27. Master SR, Hartman JL, D'Cruz CM, Moody SE, Keiper EA, Ha SI, Cox JD, Belka GK, Chodosh LA: Functional microarray analysis of mammary organogenesis reveals a developmental role in adaptive thermogenesis. Molecular Endocrinology 2002, 16(6):1185-1203. 28. Maraziotis IA, Dragomir A, Bezerianos A: Gene networks reconstruction and time-series prediction from microarray data using recurrent neural fuzzy networks. IET Systems Biology 2007, 1(1):41-50. 29. Mori R, Kondo T, Nishie T, Ohshima T, Asano M: Impairment of skin wound healing in beta-1,4-galactosyltransferase-deficient mice with - 14 -
  • reduced leukocyte recruitment. American Journal of Pathology 2004, 164(4):1303-1314. 30. Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2004, 2(11):e309. 31. Novichkova S, Egorov S, Daraselia N: MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 2003, 19:1699- 1706. 32. O'Driscoll L, McMorrow J, Doolan P, McKiernan E, Mehta JP, Ryan E, Gammell P, Joyce H, O'Donovan N, Walsh N et al: Investigation of the molecular profile of basal cell carcinoma using whole genome microarrays. Molecular Cancer 2006, 5:74. 33. Rawool SB, Venkatesh KV: Steady state approach to model gene regulatory networks-Simulation of microarray experiments. Biosystems 2007. 34. Reverter A, Barris W, Moreno-Sanchez N, McWilliam S, Wang YH, Harper GS, Lehnert SA, Dalrymple BP: Construction of gene interaction and regulatory networks in bovine skeletal muscle from expression data. Australian Journal of Experimental Agriculture 2005, 45:821-829. 35. Rudolph MC, McManaman JL, Phang T, Russell T, Kominsky DJ, Serkova NJ, Stein T, Anderson SM, Neville MC: Metabolic regulation in the lactating mammary gland: a lipid synthesizing machine. Physiological Genomics 2007, 28:323-336. 36. Stein T, Morris J, Davies C, Weber-Hall S, Duffy M-A, Heath V, Bell A, Ferrier R, Sandilands G, Gusterson B: Involution of the mouse mammary gland is associated with an immune cascade and an acute-phase response, involving LBP, CD14 and STAT3. Breast Cancer Research 2004, 6(2):R75 – R91. 37. Swanson DR: Medical literature as a potential source of new knowledge. Bulletin of the Medical Library Association 1990, 78(1):29-37. 38. Vazquez-Martinez R, Cruz-Garcia D, Duran-Prado M, Peinado JR, Castano JP, Malagon MM: Rab18 inhibits secretory activity in neuroendocrine cells by interacting with secretory granules. Traffic 2007, 8(7):867-882. 39. Wren JD, Garner HR: Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 2004, 20(2):191-198. - 15 -
  • Appendix A – Use of Python in this work Python programming had been used throughout this study, which had been incorporated into Muscorian (Ling et al., 2007). The following are code snippets to demonstrate the calculation of Poisson distribution and the intersection of Master et al., 2002 and 1-mention PubGene results as shown in Figure 1 and Table 2. Given that muscopedia.dbcursor is the database cursor and pmc_abstract table to contain the abstracts, the Poisson distribution model for each pair of entity (gene or protein) names is constructed by the function commandJobCloneOccurrencePoisson, class Poisson: mean = 0.0 def __init__(self, lamb = 0.0): self.mean = lamb def factorial(self, m): value=1 if m != 0: while m !=1: value=value*m m=m-1 return value def PDF(self, x): return math.exp(self.mean)* pow(self.mean,x)/self.factorial(x) def inverseCDF(self, prob): cprob = 0.0 x = 0 while (cprob < prob): cprob = cprob + self.PDF(x) x = x + 1 return (x, cprob) def commandJobCloneOccurrencePoisson(self): poisson = Poisson() muscopedia.dbcursor.execute(' select count(pmid) from pmc_abstract') abstractcount = float(self.muscopedia.dbcursor.fetchall()[0][0]) muscopedia.dbcursor.execute(' select jclone, occurrence from jclone_occurrence') dataset = [[clone[0].strip(), clone[1]] for clone in self.muscopedia.dbcursor.fetchall()] muscopedia.dbcursor.execute(" delete from jclone_occur_stat") count = 0 for subj in dataset: for obj in dataset: mean = (float(subj[1])/abstractcount)* (float(obj[1])/abstractcount) poisson.mean = mean (poi95, prob) = poisson.inverseCDF(0.95) (poi99, prob) = poisson.inverseCDF(0.99) count = count + 1 sqlstmt = "insert into jclone_occur_stat (clone1, clone2, randomoccur, poisson95, poisson99) values ('%s','%s','%.6f','%s','%s')" % - 16 -
  • (str(subj[0]), str(obj[0]), mean, str(poi95), str(poi99)) try: muscopedia.dbcursor.execute(sqlstmt) except IOError: pass if (count % 1000) == 0: muscopedia.dbconnect.commit() Each pair of entities was searched in each abstract using SQL statements, such as “select count(pmid) from pmc_abstract where text containing 'insulin' and 'MAPK'”, and the number of abstracts found was matched against jclone_occur_stat table for statistical significance based on the calculated Poisson distribution. The results were exported from muscopedia (Muscorian's database) as a tab-delimited file and analyzed using the following code to generate Table 2: import sets lc = open('lc_cor.csv','r').readlines() lc = [x[:-1] for x in lc] lc = [x.split('t') for x in lc] d = {} for x in lc: try: t = d[(x[1], x[0])] except KeyError: d[(x[0], x[1])] = float(x[2]) lc = [(x[0], x[1], d[x]) for x in d] l = [(x[0], x[1]) for x in d] l = sets.Set(l) def process_sif(file): a = open(file,'r').readlines() a = [x[:-1] for x in a] a = [x.split('tppt') for x in a] return [(x[0], x[1]) for x in a] a = sets.Set(process_sif('pubgene1.sif')) print "# intersect of pubgene1.sif and LC data: " + str(len(l.intersection(a))) print "# LC data not in pubgene1.sif: " + str(len(l.difference(a))) print "# pubgene1.sif not in LC data: " + str(len(a.difference(l))) print "" cor = 0.74 while (cor < 1.0): t = [(x[0], x[1]) for x in lc if x[2] > cor] l = sets.Set(t) cor = cor + 0.01 print "LC correlation: " + str(cor) print "# intersect of pubgene1.sif and LC data: " + str(len(l.intersection(a))) print "# LC data not in pubgene1.sif: " + str(len(l.difference(a))) print "# pubgene1.sif not in LC data: " + str(len(a.difference(l))) print "" - 17 -
  • Appendix B – PubGene algorithm and its main results PubGene (Jenssen et al., 2001) algorithm is a count-based algorithm which simply counts the number of abstracts with both entity names. Using “insulin” and “MAPK” as the pair of entities, PubGene algorithm can be implemented using the following SQL, “select count(pmid), 'insulin', 'MAPK' from pmc_abstract where text containing 'insulin' and text containing 'MAPK'”. 1-Mention PubGene and 5-Mention PubGene can be isolated by filtering for count(pmid) to be more than zero and four respectively. PubGene (Jenssen et al., 2001) had demonstrated that the precision of 1-Mention is 60% while the precision of 5-Mention is 72%. - 18 -
  • The Python Papers, Vol. 3, No. 3 (2008) 1 Available online at http://ojs.pythonpapers.org/index.php/tpp/issue/view/10 Automatic C Library Wrapping  Ctypes from the Trenches Guy K. Kloss Computer Science Institute of Information & Mathematical Sciences Massey University at Albany, Auckland, New Zealand Email: G.Kloss@massey.ac.nz At some point of time many Python developers  at least in computational science  will face the situation that they want to interface some natively compiled library from Python. For binding native code to Python by now a larger variety of tools and technologies are available. This paper focuses on wrapping shared C libraries, using Python's default Ctypes. Particularly tools to ease the process (by using code generation) and some best practises will be stressed. The paper will try to tell a stepbystep story of the wrapping and development process, that should be transferable to similar problems. Keywords: Python, Ctypes, wrapping, automation, code generation. 1 Introduction One of the grand fundamentals in software engineering is to use the tools that are best suited for a job, and not to prematurely decide on an implementation. That is often easier said than done, in the light of some complimentary requirements (e. g. rapid/easy implementation vs. needed speed of execution or vs. low level access to hardware). The traditional way [1] of binding native code to Python through extending or embedding is quite tedious and requires lots of manual coding in C. This paper presents an approach using the Ctypes package [2], which is by default part of Python since version 2.5. As an example the creation of a wrapper for the Little CMS colour management library [3] is outlined. The library oers excellent features, and ships with ocial Python bindings (using SWIG [4]), but unfortunately with several shortcomings (incompleteness, un-Pythonic API, complex to use, etc.). So out of need and frus- tration the initial steps towards alternative Python bindings were undertaken. An alternative would be to x or improve the bindings using SWIG, or to use one of a variety of binding tools. The eld has been limited to tools that are widely in use today within the community, and that are promising to be future proof as
  • Automatic C Library Wrapping  Ctypes from the Trenches 2 well as not overly complicated to use. These are the contestants with (very brief ) notes for use cases that suit their particular strengths: • Use Ctypes [2], if you want to wrap pure C code very easily. • Use Boost.Python [5, 6], if you want to create a more complete API for C++ that also reects the object oriented nature of your native code, including inheritance into Python, etc. • Use cython [7], if you want to easily speed up and migrate code from Python to speedier native code (Mixing is possible!). • Use SWIG [4], if you want to wrap your code against several dynamic lan- guages. Of course, wrapper code can be written manually, in this case directly using Ctypes . This paper does not provide a tutorial on how Ctypes is used. The reader should be familiar with this package when attempting to undertake serious library wrapping. The Ctypes tutorial and Ctypes reference on the project web site [2] are an excellent starting point for this. For extensive libraries and robustness towards an evolving API, code generation proved to be a good approach over manual editing. Code generators exist for Boost.Python as well as forCtypes to ease the process of wrapping: Py++ [8] (for Boost.Python ) and CtypesLib's h2xml.py [2] and xml2py.py. Three main reasons have inuenced the decision to approach this project using ctypes: • Ubiquity of the binding approach, as Ctypes is part of the default distribution. • No compilation of native code to libraries is necessary. Additionally, this relieves one from installing a number of development tools, and the library wrapper can be approached in a platform independent way. • The availability of a code generator to automate large portions of the wrapper implementation process for ease and robustness against changes. The next section of this paper will rst introduce a simple C example. This example is later migrated to Python code through the various incarnations of the Python wrapper throughout the paper. Sect. 3 introduces how to facilitate the C library code from Python, in this case through code generation. Sect. 4 explains how to rene the generated code to meet the desired functionality of the wrapper. The library is anything but Pythonic, so Sect. 5 explains an object oriented Façade API for the library that features qualities we love. This paper only outlines some interesting fundamentals of the wrapper building process. Please refer to the source code for more precise details [9].
  • Automatic C Library Wrapping  Ctypes from the Trenches 3 2 The Example The sample code (listing in Fig. 1) aims to convert image data from device dependent colour information to a standardised colour space. The input prole results from a device specic characterisation of a Hewlett Packard ScanJet (in the ICC prole HPSJTW.ICM). The output is in the standard conformant sRGB output colour space as it is used for the majority of displays on computers. For this a built-in prole from LittleCMS is used. Input and output are characterised through so called ICC proles. For the input prole the characterisation is read from a le (line 8), and a built in output prole is used (line 9). The transformation object is set up using the proles (lines 1113), specifying the colour encoding in the in- and output as well as some further parameters not worth discussing here. In the for loop (lines 1521) the image data is transformed line by line, operating on the number of pixels used per line (necessary as array rows are often padded). The goal is to provide a suitable and easy to use API to perform the same task in Python. 3 Code Generation Wrapping C data types, functions, constants, etc. with Ctypes is not particularly dicult. The tutorial, project web site and documentation on the wiki introduce this concept quite well. But in the presence of an existing larger library, manual wrapping can be tedious and error prone, as well as hard to keep consistent with the library in case of changes. This is especially true when the library is maintained by someone else. Therefore, it is advisable to generate the wrapper code. Thomas Heller, the author of Ctypes has implemented a corresponding project CtypesLib that includes tools for code generation. The tool chain consists of two parts, the parser (for header les) and the code generator. 3.1 Parsing the Header File The C header les are parsed by the tool h2xml. In the background it uses GCCXML, a GCC compiler that parses the code and generates an XML tree representation. Therefore, usually the same compiler that builds the binary of the library can be used to analyse the sources for the code generation. Alternative parsers often have problems determining a 100 % proper interpretation of the code. This is particularly true in the case of C code containing pre-processor macros, which can commit massively complex things.
  • Automatic C Library Wrapping  Ctypes from the Trenches 4 1 #include "lcms.h" 3 int correctColour(void) { 4 cmsHPROFILE inProfile, outProfile; 5 cmsHTRANSFORM myTransform; 6 int i; 8 inProfile = cmsOpenProfileFromFile("HPSJTW.ICM", "r"); 9 outProfile = cmsCreate_sRGBProfile(); 11 myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8, 12 outProfile, TYPE_RGB_8, 13 INTENT_PERCEPTUAL, 0); 15 for (i = 0; i < scanLines; i++) { 16 /* Skipped pointer handling of buffers. */ 17 cmsDoTransform(myTransform, 18 pointerToYourInBuffer, 19 pointerToYourOutBuffer, 20 numberOfPixelsPerScanLine); 21 } 23 cmsDeleteTransform(myTransform); 24 cmsCloseProfile(inProfile); 25 cmsCloseProfile(outProfile); 27 return 0; 28 } Figure 1: Example in C using the LittleCMS library directly. 3.2 Generating the Wrapper In the next stage the parser tree in XML format is taken to generate the binding code in Python using Ctypes. This task is performed by the xml2py tool. The gener- ator can be congured in its actions by means of switches passed to it. Of particular interest here are the -k and the -r switches. The former denes the kind of types to include in the output. In this case the #defines, functions, structure and union denitions are of interest, yielding -kdfs. Note: Dependencies are resolved auto- matically. The -r switch takes a regular expression the generator uses to identify symbols to generate code for. The full argument list is shown in the listing in Fig. 2 (lines 1115). The generated code is written to a Python module, in this case _lcms. It is made private by convention (leading underscore) to indicate that it is not to be used or modied directly.
  • Automatic C Library Wrapping  Ctypes from the Trenches 5 3.3 Automating the Generator Both h2xml and xml2py are Python scrips. Therefore, the generation process can be automated in a simple generator script. This makes all steps reproducible, docu- ments the used settings, and makes the process robust towards evolutionary (smaller) changes in the C API. A largely simplied version is in the listing of Fig. 2. 1 # Skipped declaration of paths. 2 HEADER_FILE = ’lcms.h’ 3 header_basename = os.path.splitext(HEADER_FILE)[0] 5 h2xml.main([’h2xml.py’, header_path, 6 ’-c’, 7 ’-o’, 8 ’%s.xml’ % header_basename]) 10 SYMBOLS = [’cms.*’, ’TYPE_.*’, ’PT_.*’, ’ic.*’, ’LPcms.*’, ...] 11 xml2py.main([’xml2py.py’, ’-kdfs’, 12 ’-l%s’ % library_path, 13 ’-o’, module_path, 14 ’-r%s’ % ’|’.join(SYMBOLS), 15 ’%s.xml’ % header_basename] Figure 2: Essential parts of the code generator script. Generated code should never be edited manually. As some modication will be necessary to achieve the desired functionality (see Sect. 4), automation becomes essential to yield reproducible results. Due to some shortcomings (see Sect. 4) of the generated code however, some editing was necessary. This modication has also been integrated into the generator script to fully remove the need of manual editing. 4 Rening the C API In the current version of Ctypes in Python 2.5 it is not possible to add e. g. __repr__() or __str__() methods to data types. Also, code for loading the shared library in a platform independent way needs to be patched into the generated code. A function in the code generator reads the whole generated module _lcms and writes it back to the le system, and in the course replacing three lines from the beginning of the le with the code snippet from the listing in Fig. 3. _setup (listing in Fig. 4) monkey patches 1 the class ctypes.Structure to include a __repr__() method (lines 410) for ease of use when representing wrapped objects for output. Furthermore, the loading of the shared library (DLL in Windows lingo) 1A monkey patch is a way to extend or modify the runtime code of dynamic languages without altering the original source code: http://en.wikipedia.org/wiki/Monkey_patch
  • Automatic C Library Wrapping  Ctypes from the Trenches 6 1 from _setup import * 2 import _setup 4 _libraries = {} 5 _libraries[’/usr/lib/liblcms.so.1’] = _setup._init() Figure 3: Lines to be patched into the generated module _lcms. is abstracted to work in a platform independent way using the system's default search mechanism (lines 1213). 1 import ctypes 2 from ctypes.util import find_library 4 class Structure(ctypes.Structure): 5 def __repr__(self): 6 """Print fields of the object.""" 7 res = [] 8 for field in self._fields_: 9 res.append(’%s=%s’ % (field[0], repr(getattr(self, field[0])))) 10 return ’%s(%s)’ % (self.__class__.__name__, ’, ’.join(res)) 12 def _init(): 13 return ctypes.cdll.LoadLibrary(find_library(’lcms’)) Figure 4: Extract from module _setup.py. 4.1 Creating the Basic Wrapper Further modications are less invasive. For this, the C API is rened into a module c_lcms. This module imports everything from the generated._lcms and overrides or adds certain functionality individually (again through monkey patching). These are intended to make the C API a little bit easier to use through some helper functions, but mainly to make the new bindings more compatible with and similar to the ocial SWIG bindings (packaged together with LittleCMS ). The wrapped C API can be used from Python (see Sect. 4.2). Although, it still requires closing, freeing or deleting from the code after use, and c_lcms objects/structures do not feature methods for operations. This shortcoming will be solved later. 4.2 c lcms Example The wrapped raw C API in Python behaves in exactly the same way, it is just implemented in Python syntax (listing in Fig. 5).
  • Automatic C Library Wrapping  Ctypes from the Trenches 7 1 from c_lcms import * 3 def correctColour(): 4 inProfile = cmsOpenProfileFromFile(’HPSJTW.ICM’, ’r’) 5 outProfile = cmsCreate_sRGBProfile() 7 myTransform = cmsCreateTransform(inProfile, TYPE_RGB_8, 8 outProfile, TYPE_RGB_8, 9 INTENT_PERCEPTUAL, 0) 11 for line in scanLines: 12 # Skipped handling of buffers. 13 cmsDoTransform(myTransform, 14 yourInBuffer, 15 yourOutBuffer, 16 numberOfPixelsPerScanLine) 18 cmsDeleteTransform(myTransform) 19 cmsCloseProfile(inProfile) 20 cmsCloseProfile(outProfile) Figure 5: Example using the basic API of the c_lcms module. 5 A Pythonic API To create the usual pleasant batteries included feeling when working with code in Python, another module  littlecms  was manually created, implementing the Façade Design Pattern. From here on we are moving away from the original C-like API. This high level object oriented Façade takes care of the internal handling of tedious and error prone operations. It also performs sanity checking and automatic detection for certain crucial parameters passed to the C API. This has drastically reduced problems with the low level nature of the underlying C library. 5.1 littlecms Example Using littlecms the API is now object oriented (listing in Fig. 6) with a doTransform() method on the myTransform object. But there are a few more in- teresting benets of this API: • Automatic disposing of C API instances hidden inside the Profile and Transform classes. • Largely reduced code size with an easily comprehensible structure. • Redundant passing of information (e. g. the in- and output colour spaces) is determined within the Transform constructor from information available in the Profile objects.
  • Automatic C Library Wrapping  Ctypes from the Trenches 8 • Uses NumPy [10] arrays for convenience in the buers, rather than introducing further custom types. On these data array types and shapes can be automat- ically matched up. • The number of pixels for each scan line placed in yourInBuffer can usually be detected automatically. • Compatible with the often used PIL [11] library. • Several sanity checks prevent clashes of erroneously passed buer sizes, shapes, types, etc. that would otherwise result in a crashed or hanging process. 1 from littlecms import Profile, PT_RGB, Transform 3 def correctColour(): 4 inProfile = Profile(’HPSJTW.ICM’) 5 outProfile = Profile(colourSpace=PT_RGB) 6 myTransform = Transform(inProfile, outProfile) 8 for line in scanLines: 9 # Skipped handling of buffers. 10 myTransform.doTransform(yourNumpyInBuffer, yourNumpyOutBuffer) Figure 6: Example using the object oriented API of the littlecms module. 6 Conclusion Binding pure C libraries to Python is not very dicult, and the skills can be mastered in a rather short time frame. If done right, these bindings can be quite robust even towards certain changes in the evolving C API without the need of very time consuming manual tracking of all changes. As with many projects for this, it is vital to be able to automate the mechanical processes: Beyond the outlined code generation in this paper, an important role comes to automated code integrity testing (here: using PyUnit [12]) as well as an API documentation (here: using Epydoc [13]). Unfortunately, as CtypesLib is still work in progress, the whole process did not go as smoothly as described here. It was particularly important to match up working versions properly between GCCXML (which in itself is still in development) and CtypesLib. In this case a current GCCXML in version 0.9.0 (as available in Ubuntu Intrepid Ibex, 8.10) required a branch of CtypesLib that needed to be checked out through the developer's Subversion repository. Furthermore, it was necessary to develop a x for the code generator as it failed to generate code for #defined oating point constants. The patch has been reported to the author and is now in the source code repository. Also patching into the generated source code for overriding some
  • Automatic C Library Wrapping  Ctypes from the Trenches 9 features and manipulating the library loading code can be considered as being less than elegant. Library wrapping as described in this paper was performed on version 1.16 of the LittleCMS library. While writing this paper the author has moved to the now stable version 1.17. Adapting the Python wrapper to this code base was a matter of about 15 minutes of work. The main task was xing some unit tests due to rounding dierences resulting from an improved numerical model within the library. The author of LittleCMS made a rst preview of the upcoming version 2.0 (an almost complete rewrite) available recently. Adapting to that version took only about a good day of modications, even though some substantial changes were made to the API. But even for this case only very little amounts of new code had to be written. Overall, it is foreseeable that this type of library wrapping in the Python world will become more and more ubiquitous, as the tools for it mature. But already at the present time one does not have to fear the process. The time spent initially setting up the environment will be easily saved over all projects phases and iterations. It will be interesting to see Ctypes evolve to be able to interface to C++ libraries as well. Currently the developers of Ctypes and Py++ (Thomas Heller and Roman Yakovenko) are evaluating potential extensions. References [1] Ocial Python Documentation: Extending and Embedding the Python Inter- preter , Python Software Foundation. [2] T. Heller,  Python Ctypes Project, http://starship.python.net/crew/theller/ ctypes/, last accessed December 2008. [3] M. Maria,  LittleCMS project, http://littlecms.com/, last accessed December 2008. [4] D. M. Beazley and W. S. Fulton,  SWIG Project, http://www.swig.org/, last accessed December 2008. [5] D. Abrahams and R. W. Grosse-Kunstleve,  Building Hybrid Systems with Boost.Python, http://www.boostpro.com/writing/bpl.html, March 2003, last accessed December 2008. [6] D. Abrahams,  Boost.Python Project, http://www.boost.org/libs/python/, last accessed December 2008. [7] S. Behnel, R. Bradshaw, and G. Ewing,  Cython Project, http://cython.org/, last accessed December 2008. [8] R. Yakovenko,  Py++ Project, http://www.language-binding.net/ pyplusplus/pyplusplus.html, last accessed December 2008.
  • Automatic C Library Wrapping  Ctypes from the Trenches 10 [9] G. K. Kloss,  Source Code: Automatic C Library Wrapping  Ctypes from the Trenches, The Python Papers Source Codes [in review] , vol. n/a, p. n/a, 2009, [Online available] http://ojs.pythonpapers.org/index.php/tppsc/issue/. [10] T. Oliphant,  NumPy Project, http://numpy.scipy.org/, last accessed Decem- ber 2008. [11] F. Lundh,  Python Imaging Library (PIL) Project, http://www.pythonware. com/products/pil/, last accessed December 2008. [12] S. Purcell,  PyUnit Project, http://pyunit.sourceforge.net/, last accessed De- cember 2008. [13] E. Loper,  Epydoc Project, http://epydoc.sourceforge.net/, last accessed De- cember 2008.
  • Open Source Developer's Conference 2008 by Tennessee Leeuwenburg My presentation slides: Natural language generation for weather forecasting Google Apps Engine + ExtJS During the first week of December 2008, the Open Source Developer's Conference was held in the Sydney Masonic Centre. I also took the opportunity to speak at the Sydney Python User's Group meeting. The Open Source Developer's Conference was first run in 2004. The Open Source movement is a not-for-profit exercise undertaken by thousands of software developers who are united by a common passion for information technology. The underlying principle of open source development is that we have more to gain from sharing our technology innovation than from guarding it. In his central piece on this idea, “The Cathedral and the Bazaar”, Eric S Raymond describes how Open Source software delivers great value to individuals and organisations. Where software is used not as the basis for competitive advantage, but as enabling infrastructure, we gain far more from the contributions of the open source community than the opportunity cost of being unable to sell our code to others. That is, we benefit far more from the work of others on technologies such as email, filesystems and so forth than we could gain by attempting to build and sell a competitor technology. My experience as a software developer has tended to reinforce this idea, both in government and in private industry. Often organisations will choose to innovate in a few key areas, where they hold a sustainable competitive advantage through their knowledge and expertise. Outside of that area, most organisations are more dependent on the technologies of others than they are in control of it. When it comes to the reliability of web servers, email servers, file formats and other software, the organisation is essentially 'bobbing on the tide' of the development efforts of others. However, by using open source software, they gain the ability to make key changes to the source code if they need to, and also benefit from rapid ongoing development of that software occurring outside of the organisation. The OSDC conference is a regional opportunity for software developers who participate in this process to meet, discuss ideas and learn from one another. Due to the niche interest, it often attracts a very passionate audience and leads to some very fruitful interactions. I attended a number of sessions, the notes for which are available publicly from my blog (myownhat.blogspot.com). I have included a one-paragraph description here, linking to each blog post. 1. Google Hackathon: This event was run on location by Google, and presented their toolkits for write web applications, including OpenSocial, OpenLayers and Google App Engine. These technologies enable developers to freely write applications for social networking and mapping, while App Engine provides a great application hosting environment. 2. Keynote by Chris DeBona, head of licensing at Google: This presentation talks about how Google uses Open Source software. Google has a great deal of commercially valuable code and hardware infrastructure, but makes heavy use of open source software outside of their core business in order to make their business far more efficient. Chris talks about how this is structured, and also why they choose to release their code under the Apache license (it has good patent-related clauses in it which are very useful in the US). 3. Google Apps Engine + ExtJS: This was my presentation on using Google App Engine and the ExtJS javascript library to develop rich internet applications.
  • 4. Unittest: An under-appreciated gem: This presentation discussed Python's unittest module. Unit testing is an approach to software testing which is popular in the Open Source world, especially within groups following Agile methodologies. This presentation covered how to use unittest, and where to go for some useful extensions. In the GFE we use automated testing, but not unit testing. It would probably be worthwhile looking at how to include unit testing in our test suites. 5. Openspatial software overview: This presentation was a very-wide ranging survey of the latest open source software which could be used to manipulate, render or public spatial data. It highlighted a conference dedicated to this topic called FOSS4G which is due to be held next year in Sydney, attracting around 500-700 delegates. This software review would be a good list of research items for anyone wanting to publish spatial data. 6. Bazaar version control: Bazaar is a distributed version control system. Traditionally, software development projects would include a single, authoritative point where all software code would be placed by developers. This can be very clumsy in many situations. Bazaar allows developers to share entire code branches using file archives, so that they can more easily share code changes and suggestions directly with one another. This would allow, for example, two developers to more easily 'compare notes' on a few feature without having to check the code into the main repository. Without any rush, it is probably worth using this technology in the future. 7. The state of Python: A new version of Python has just been released. This presentation describes version 2.5.X (the most recent version of the old language), 2.6.X (a version which supports both the new and old language structures) and 3.X which is a backward- incompatible language revision, supporting many new code structures and syntax elements which should make new programs easier to write and more elegant. 8. Legal issues in Open Source: This presentation really covered legal issues faced by any group software development endeavour, particularly focusing on a startup environment. The advantages of having an incorporated vehicle to hold copyright on the code produced was highlighted. Without a single copyright-owning entity, it is difficult to change software license. Further, it is important to keep track of all code checkins and commits. If a legal case is brought which has any of the lines of code in contention, it can be very useful to have attribution of each line of code. Fortunately, good version control software will allow this. 9. Natural language generation for weather forecasting: This presentation covered a description of the GFE software, the uses which it is put to within the Bureau of Meteorology and a description of the automatic text generation logic. 10. Your code sucks and I hate you (code review for human beings): This excellent presentation covered some of the dos and don'ts of code review. While code review is good practise in any software project, it is bread and butter for all open source projects where access to the code is usually controlled by a core group of developers. It outlines good communication practises for both reviewer and reviewee, as well as some of the social issues that can come up during code review. 11. OLPC in Australia: The One Laptop Per Child program aims to provide inexpensive laptops to very disadvantaged children for the purposes of improving education in these areas. This is an area where software developers can make a direct contribution using their professional expertise. It is also said that education is the only long-term solution to poverty. Pia Waugh outlines what the OLPC project is doing in Australia, the technology of the OLPC laptop, and how to get involved. These amazing presentations are just a few of what was on offer. In addition, I was able to meet with a great many Australian developers, including some who will be lasting contacts. It is clear that there is a reasonable number of open source developers working in various government groups
  • around the country. Next year, I hope to do more to foster collaboration between us. IT Collaboration One of the points raised and raised again at OSDC was that code reviews, a staple of Open Source development, are one of the most effective ways for software developers to learn. By having others see your well-written code and provide feedback on how to improve less well-written code, it is possible to take part in continual learning. Many organisations do not have a significant culture of code reviews and group learning. There is also a great opportunity here for non-programmers who take part in software development (perhaps as domain experts) to learn about current programming best practise, and for IT staff to gain a greater appreciation of the domain from other experts. The advantage of the code review process is that it is fairly lightweight and effective. I think it would be great for any project leaders or supervisors to encourage their employees to take part in collaborative processes to enable better cross-fertilisation of ideas. Introduction of Unit Testing and other practises In most organisations where I have worked, there have been few standard development guidelines or practises, either formal or cultural. The downside of strict formal guides are that development may be slow and/or restricted. The upside of having some cultural guidelines is that developers can learn a lot. Rather than having to find out for themselves what the best practises are, they will have an opportunity to learn from their colleagues. While I am sure most IT staff are skilled individuals, there is always something more for everyone to learn. Unit testing is one particular practise which is gaining a wide adoption in the Open Source community for the purposes of improving software reliability. Others include code reviews (as mentioned), a variety of bug tracking and project management systems and so forth. I make use of some of these tools, but not all of them, as a result of the particular experiences and work cultures which have led me to where I am today. Introducing just a few standard processes and tools, as well as a culture of awareness about new trends, should help developers to get to grips with these processes and tools without having to experience the (overwhelming) feeling of “going it alone”. Connecting with other groups One principle of open source development is that there is no point in everyone fixing the same problems again and again and again. Rather than twenty people fixing the same problem twenty times, it should only be fixed once. The larger the size of the community, the greater the efficiency. Resolving bugs can go from a cost of 1:1 in the case of an individual developer, to 20:1 by sharing an open project between 20 people, to hundreds or thousands to 1. One way to do this within an organisational ecosystem would be to find shared communities of practise whereby common software may be used. Issues solved by one person are fixed for everyone else's benefit, and issues encountered by one person can be overcome with the help of everyone else. Connecting with groups outside of the organisation would afford the opportunity to increase the size of the group and thereby increase the efficiency of resolving issues within these shared systems. While there are some mailing lists which exist to service this need, they may be little-used (it really depends on the organisation, the software being used and the people involved). There is an opportunity for developers to engender the use of these communication channels throughout their organisation, but this will need to be the result of an organic process by which open source tools become part of the everyday arsenal.
  • Feedback to: tleeuwenburg@gmail.com