• Like
Data linking with kblog
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Data linking with kblog



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • So, today I am going to talk about data linking with knowledge blog. Normally, talks start at the beginning. I thought to buck this trend and instead...
  • Start at the end....The long tail was mentioned yesterday. Much research data comes from individual research labsFrom individual researchers, each producing relatively small amounts of data, but collectivelyProducing a lot. So, long tail or big science?My field, bioinformatics, does both.
  • But the data from the long tail and big science is different. While big science generally produces Sequence data, which is generally all of the same type. The long tail doesn’t. For example, We start with microarray expression data. Then we have MIAME compliant metadata, An RNA degredation plot and finally a paper, in this case a random one that I found on PLoSYesterday. Of these, we have data standards for many parts – the second part, often called “metadata” even Though it isn’t, whichusesMIAME which is one of the older information content standards in Bioinformatics. To me, all of this is data. Without the later three, the “raw data” is just junk.
  • The paper is the richest form in terms of expressivity – is carries the most complex ideas, usesThe largest vocabulary. Also the least open to reuse, although in general it gives meaning to all the rest. And is the form of scientific data storage Which has changed the least
  • So, what is the problem. Well first the process of publishing is very time-consuming. Secondly, it’s very expensive. And finally, it’s a process where, to misquote Douglas AdamsWhich is so amazingly primitive that we still think PDFs are a pretty neat idea. But in general, this form of data capture only happens for the most cherry picked data. The positive data, the significant data, the data where the experiment worked. What aboutThe negative data, the insignificant, what about the standard operating procedure, what about the tutorialInformation and so on. This is not a small issue – the massive publication bias in biology hampersOur understanding of the way that organisms function. In medicine, people die because not through lack of knowledge, but because we cannot collate information that exists.
  • So, why is this the case. Well, scientific publishing is basically still at the stage of coach building.Consider these stats: the second biggest STM publisher in the world looks like this – and costs1.5 billion euros per annum. This is Elsevier. The biggest looks like this. It only costs 10 million dollars per annum. This is wikipedia.Is this comparison fair? Are the two equivalent? No, probably not, but they are not two orders Of magnitude different either.
  • Consider for example this process from one of the major publishers that I have Published with. I wrote my article in latex. I converted it to PDF. The website converted it to anotherPDF (which I had to check). The publishers then (and this is true) converted it to a word doc. From there, they turn it into XML, which was finally converted to HTML and, yes, you guessedIt, another PDF. Now, not only is this a waste of time, but it’s inaccurate. Errors happen. And trying to get Structured or data linked publications through this process. You might as well give up.
  • My solution.Wordpress. Actually, more importantly, commodity software. And by commodity, I mean commodity, and not research. There are some excellent tools from academia – widely used. Open Journal Systems, for example, powers6000 journals. Wordpress is behind 10% of ALL websites.
  • Why wordpress. Well, it has an edit dialog. But it’s not very good. But you can blog from word – I don’t think that is very good either. But, it is the way that itIs, it’s what people use. So wordpress fits in with peoples workflows. It supports everything. Nothing would ever convince me to add this level of support to a tool.
  • What other features are suitable for academic publishing. Well, here, we borrowed, stole and occasionally wrote our own. Reviewings – courtesy of EditFlow. Metadata, and crawlability features we added. Multiple authors we borrowed. These allow archiving – this comes from the UK web archive. Also searchability (google scholar)
  • Bi Directional links. As well as permalinks, it also supports legacy identifiers in the shape of DOIs --- thanks to datacite. And it’s extensible. So I added nice look maths (scalable, thanks to mathjax), syntax highlighting. Bibliographic support Exists . We can do typed linking, with CiTO (thanks to David Shotton), although clunkily at the moment. This will beImproved – also want to add client renderable – the user should choose the citation format. And finally, epub and even PDF export.
  • We also want to extend bi-directional linking – blogs do this out of the box, but support required at both ends.And finally we want to be able to embed the data directly into the paper.
  • So, why are people not doing this already. I’ve now spent a fair bit of time learning PhP, javascript. And whilePoking around in the innards of wordpress I have discovered something that I now reveal to you
  • Short articles, single author, example based articles.


  • 1. Data linking with kblog
    Phillip Lord
    Newcastle University
  • 2. The Long Tail
  • 3. Example Data
    1007_s_at 2.867330709
    1053_at 10.50302152
    117_at 2.702517066
    121_at 3.052316166
    1255_g_at 2.278998026
    1294_at 5.360226024
    1316_at 5.496447322
    1320_at 4.475412175
    1405_i_at 2.301359647
  • 4. The paper
  • 5. The problem?
  • 6. Coach Building
    250,000 articles per year
    240 million Downloads
    Cost: 1.5 Billion Euro
    17 million articles
    > 20 languages
    365 million readers
    Total Cost: 10 million dollars
  • 7. The process
  • 8. Our Solution
  • 9. Wordpress
    Has one critical feature
    It has an edit dialog
    Open Office
    By email
  • 10. Features
    Metadata – coins, metatags *
    Crawlability *
    Multiple authors
    Archiving (UKWA)
  • 11. Features
    Bi-directional links
    Permalinks (purls to follow)
    DOIs (datacite!)
    Nice maths * (and mathjax)
    Syntax Highlighting
    Bibliographic Support (with DOIs, and incompletely CiTO) *
    ePUB and PDF (!?) export
  • 12. Data Linking
    Bi-directional links require support at both ends
    Adding this generically
    Adding this for specific data sets (microarray)
    Data linking into papers
  • 13. Old technology
    Most of this technology pre-exists
    So why don’t people use it!
    There is a good reason...
  • 14. Content
    Now has 15k page views (not hits!)
    25 articles, multiple authors
    Seeking pubmed inclusion
    Advertising: two blog articles about ontogenesis happened with 1 day of first article.
    10 articles
    About scientific workflows
    Supplement to myExperiment
  • 15. Well...
    These stats are not going to scare either Elsevier or Wikipedia
    But, they are not bad either
    And it allows primary scientific content of many different forms
    We believe it can form part of the scientific landscape
  • 16. Acknowledgements
    Phillip Lord (me!)
    Dan Swan
    Simon Cockell
    Robert Stevens (Manchester)
    Georgina Moulton (Manchester)
    Thanks also to JISC, David Shotton, BL, Datacite, and wordpress.