So, today I am going to talk about data linking with knowledge blog. Normally, talks start at the beginning. I thought to buck this trend and instead...
Start at the end....The long tail was mentioned yesterday. Much research data comes from individual research labsFrom individual researchers, each producing relatively small amounts of data, but collectivelyProducing a lot. So, long tail or big science?My field, bioinformatics, does both.
But the data from the long tail and big science is different. While big science generally produces Sequence data, which is generally all of the same type. The long tail doesn’t. For example, We start with microarray expression data. Then we have MIAME compliant metadata, An RNA degredation plot and finally a paper, in this case a random one that I found on PLoSYesterday. Of these, we have data standards for many parts – the second part, often called “metadata” even Though it isn’t, whichusesMIAME which is one of the older information content standards in Bioinformatics. To me, all of this is data. Without the later three, the “raw data” is just junk.
The paper is the richest form in terms of expressivity – is carries the most complex ideas, usesThe largest vocabulary. Also the least open to reuse, although in general it gives meaning to all the rest. And is the form of scientific data storage Which has changed the least
So, what is the problem. Well first the process of publishing is very time-consuming. Secondly, it’s very expensive. And finally, it’s a process where, to misquote Douglas AdamsWhich is so amazingly primitive that we still think PDFs are a pretty neat idea. But in general, this form of data capture only happens for the most cherry picked data. The positive data, the significant data, the data where the experiment worked. What aboutThe negative data, the insignificant, what about the standard operating procedure, what about the tutorialInformation and so on. This is not a small issue – the massive publication bias in biology hampersOur understanding of the way that organisms function. In medicine, people die because not through lack of knowledge, but because we cannot collate information that exists.
So, why is this the case. Well, scientific publishing is basically still at the stage of coach building.Consider these stats: the second biggest STM publisher in the world looks like this – and costs1.5 billion euros per annum. This is Elsevier. The biggest looks like this. It only costs 10 million dollars per annum. This is wikipedia.Is this comparison fair? Are the two equivalent? No, probably not, but they are not two orders Of magnitude different either.
Consider for example this process from one of the major publishers that I have Published with. I wrote my article in latex. I converted it to PDF. The website converted it to anotherPDF (which I had to check). The publishers then (and this is true) converted it to a word doc. From there, they turn it into XML, which was finally converted to HTML and, yes, you guessedIt, another PDF. Now, not only is this a waste of time, but it’s inaccurate. Errors happen. And trying to get Structured or data linked publications through this process. You might as well give up.
My solution.Wordpress. Actually, more importantly, commodity software. And by commodity, I mean commodity, and not research. There are some excellent tools from academia – widely used. Open Journal Systems, for example, powers6000 journals. Wordpress is behind 10% of ALL websites.
Why wordpress. Well, it has an edit dialog. But it’s not very good. But you can blog from word – I don’t think that is very good either. But, it is the way that itIs, it’s what people use. So wordpress fits in with peoples workflows. It supports everything. Nothing would ever convince me to add this level of support to a tool.
What other features are suitable for academic publishing. Well, here, we borrowed, stole and occasionally wrote our own. Reviewings – courtesy of EditFlow. Metadata, and crawlability features we added. Multiple authors we borrowed. These allow archiving – this comes from the UK web archive. Also searchability (google scholar)
Bi Directional links. As well as permalinks, it also supports legacy identifiers in the shape of DOIs --- thanks to datacite. And it’s extensible. So I added nice look maths (scalable, thanks to mathjax), syntax highlighting. Bibliographic support Exists . We can do typed linking, with CiTO (thanks to David Shotton), although clunkily at the moment. This will beImproved – also want to add client renderable – the user should choose the citation format. And finally, epub and even PDF export.
We also want to extend bi-directional linking – blogs do this out of the box, but support required at both ends.And finally we want to be able to embed the data directly into the paper.
Short articles, single author, example based articles.
Data linking with kblog Phillip Lord Newcastle University
The Long Tail http://en.wikipedia.org/wiki/File:La_Palmyre_041-crop.jpg
Example Data ID_REF VALUE 1007_s_at 2.867330709 1053_at 10.50302152 117_at 2.702517066 121_at 3.052316166 1255_g_at 2.278998026 1294_at 5.360226024 1316_at 5.496447322 1320_at 4.475412175 1405_i_at 2.301359647
Features Bi-directional links Permalinks (purls to follow) DOIs (datacite!) Versioning Extensibility Nice maths * (and mathjax) Syntax Highlighting Bibliographic Support (with DOIs, and incompletely CiTO) * ePUB and PDF (!?) export
Data Linking Bi-directional links require support at both ends Adding this generically Adding this for specific data sets (microarray) Data linking into papers
Old technology Most of this technology pre-exists So why don’t people use it! There is a good reason... TECHNOLOGY IS BORING
Content http://ontogenesis.knowledgeblog.org Now has 15k page views (not hits!) 25 articles, multiple authors Seeking pubmed inclusion Advertising: two blog articles about ontogenesis happened with 1 day of first article. http://taverna.knowledgeblog.org 10 articles About scientific workflows Supplement to myExperiment
Well... These stats are not going to scare either Elsevier or Wikipedia But, they are not bad either And it allows primary scientific content of many different forms We believe it can form part of the scientific landscape
Acknowledgements Phillip Lord (me!) Dan Swan Simon Cockell Robert Stevens (Manchester) Georgina Moulton (Manchester) Thanks also to JISC, David Shotton, BL, Datacite, and wordpress.