  1. 1. Alex D. WadeSenior Research Program Manager External Research Microsoft Research Microsoft Corporation
  2. 2. • Science @ Microsoft – and the role of Scholarly Communication• Office 2007 – File Format Overview – Bibliography Support – UI Extensibility• A Sampling of Related Projects
  3. 3. Putting computing into science… Applying Microsoft products and research technologies to advance the scientific research and engineering innovation processPutting science into computing… Ensuring that research community requirements are factored into future versions of Microsoft software • Advancement of Science • Global Collaboration • Technology Excellence • Interoperability
  4. 4. • Science + computation are not the entire equation • Authoring, Analysis, Publishing, Discoverability, and Data Storage/Preservation are key components to scientists’ everyday work…and Microsoft’s core businesses• The scholarly community has made it clear to us: • Microsoft must improve its offerings throughout the scholarly communication lifecycle• Our approach: Conduct prototyping projects and proofs-of-concept to evolve Microsoft’s scholarly communication offerings
  5. 5. • Data Acquisition and Modeling – Data capture from source, cleaning, storage, etc. – SQL Server, SQL Integration Services, Windows Workflow Foundation • Support Collaboration – Allow researchers to work together, share context, facilitate interactions – SharePoint Server, One Note 2007 (shared) • Data Analysis, Modeling, and Visualization – Mining techniques (OLAP, cubes) and visual analytics – SQL Analysis Services, BI, Excel, Optima, SILK (MSR-A) • Disseminate and Share Research Outputs – Publish, Present, Blog, Review and Rate – Word, PowerPoint • Archiving – Published literature, reference data, curated data, etc. – SQL ServerMicrosoft is the only company that can offer end-to-end support 5
  6. 6. • Optimize for data-driven research & science – To both data (scientific) and to information (scholarly publications) – Reproducible research + computational science – Properly document / annotate scholarly output• Interoperability is paramount – Actively lobby and drive for consensus around technical standards and standardized protocols proactively adopted by the community; enable broad community engagement • Customers have told Microsoft that the interoperability (and intellectual property) are OUR responsibility• Data preservation (and provenance) should be baseline – Documentation of the data’s provenance – Reliable and secure long-term storage – at a massive scale – Preservation needs to be like “accessibility” features – i.e., assumed as required• Social networking & semantic knowledge discovery – Harnessing collective intelligence must be a consideration – since accessing research is a core step in the life-cycle. Enable knowledge discovery – Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier• Metadata conventions / taxonomies / ontologies – This is a crucial strength for libraries – and a critical component in enabling Web 2.0
  7. 7. • New file format – New file extension (DOCX) – All content expressed in XML (Office Open XML) – Contained in a zip file (OPC)• ECMA specification – 376 & ISO Standard – OpenXML – Open Packaging Conventions
  8. 8. • Easy to access the different parts of document – XML file – Images – Annotations• Simpler to transform Word’s XML into other XML formats or extract relevant data• Ability to build .docx files programmatically or through transformations• Ability to extend Word UI (and content) to support additional or custom data
  9. 9. • Compatibility pack – Open and save to docx from older Word versions• Add-in to export to PDF or XPS• ODF Converter – Open Source project on SourceForge – Provides two-way conversion between ODF and OpenXML (WordprocessingML, SpreadsheetML, and PresentationML) – ‘Save As ODF’ to be included in Office 2007 SP2
  10. 10. • Manual Entry of Source Metadata
  11. 11. • Sources saved as Bibliography XML• Sources.XML contains all sources• Sources can be imported into new documents for easy reuse• Sources.XML can be shared between users• Documentation Styles are XSLTs
  12. 12. • Citations and Bibliographies can be inserted inline with a single click• Automatically Formatted according to active Documentation Style
  13. 13. • Ribbon Control• Research Pane• Smart Tags
  14. 14. • Tools for Authors – Search Commands in Office – Ribbon for Researchers• Semantic Information – Ontology-based markup of scholarly papers – Authoring of chemical drawings + semantic information – NLM DTD (Pablo Fernicola)• Data Preservation & Access – File format preservation + interoperability – Scientific datasets for research reproducibility – Publisher submission workflow for dataset archiving
  15. 15. Search Commands in Office Search Commands in Office Office Labs Office LabsGoals• Office 2007 Add-in that aids in finding commands, options, wizards and galleries in Word, Excel and PowerPoint• Includes Guided Help, which acts as a tour guide for specific tasksProject Status• Available now via http://www.officelabs.com/projects/searchcommands/
  16. 16. Ribbon for ResearchersRibbon for Researchers Concept Concept
  17. 17. Search against the Live Search Search against the Live Search Academic service straight Academic service straight from within Word from within Word One-click insert to the One-click insert to the bibliography bibliographyIntegration with various services Integration with various services
  18. 18. Semantic Markup in Word 2007 Semantic Markup in Word 2007 with UC San Diego with UC San DiegoGoals• Semantic markup using domain-specific ontologies and controlled vocabularies• Facilitate/automate referencing to PDB (and other resources) from manuscript• A domain-specific ontology is downloaded and made available from within Microsoft Word 2007• Authors can record their intention, the meaning of the terms they use based on their community’s agreed vocabularyProject Status• Phase 1 complete• Beta testing with PLoS later this year
  19. 19. Domain-specific ontology Annotations travel with the document Can be used to improve domain-specific discovery of information, cross-linking, etc. Support for annotations straight from within Word
  20. 20. Chemistry Drawing for Office Chemistry Drawing for Office Preliminary investigation Preliminary investigationGoals• Support students/researchers in simple chemistry structure authoring/editing• Storage and transportability of semantic chemical data not just images via Chemistry Markup Language (CML)• Enable automatic extraction/harvesting of chemical dataProject Status• Early investigation stage• Will be encouraging on-going publisher feedback
  21. 21. PLANETS PLANETS Long-term Preservation of Long-term Preservation of Digital Objects Digital ObjectsOrganization• EU Commission Project, €14M for 4 years• Consortium of 5 national libraries, 4 national archives, 4 universities and 4 industry partnersGoals• Tools and methods for sustainable long-term preservation of digital objects• Preservation of Office Documents based on OpenXMLProject Status• OpenXML conversion tools available now: – http://research.microsoft.com/research/rpp/projects/MSConversionTools/OpenXMLConversionTools.htm
  22. 22. GenePattern for Word 2007 GenePattern for Word 2007 with Broad Institute @ MIT with Broad Institute @ MITGoals•Integrate data/images from GenePattern workflows into research papers.•Allow for research reproducibility by combining data with the text•Highlight OpenXML and Office 2007 technologies and break new researchground with the integration of data & workflows with research papers•Testing/linkage to other labs – moving beyond initial installationProject Status•Currently in final phase of testing•Will move into production in June 2008•Code to be published http://www.codeplex.com
  23. 23. Data Archive Project Data Archive Project with Johns Hopkins University with Johns Hopkins UniversityGoals•Mechanism for long-term preservation of data sets•Authoring tool to support creation of relationship resource map•Use of OAI-ORE resource maps for collection description•Workflow for text & data linkage between publisher and data archive
  24. 24. Word 2007 OPC format Word 2007 OPC format contains data set(s) as well as contains data set(s) as well as resource map of resource map of relationships. relationships.author Publisher retains article and Publisher retains article and replaces it with the article replaces it with the article URL. Forwards data to Data URL. Forwards data to Data publisher Archive Archivearchive Archive stores data set(s) and Archive stores data set(s) and returns data set URL(s) to publisher returns data set URL(s) to publisher as part of updated resource map as part of updated resource map
  25. 25. • Direct publisher/repository submission via Word• Research Output Repository Platform• Conference Management Tool• eJournal Service• … Alex D. Wade alex.wade@microsoft.com http://www.microsoft.com/science/
  26. 26. Compatibility packs for older versions of Word• http://www.microsoft.com/downloads/details.aspx?FamilyId=941B3470-3AAdd-in for saving to PDF or XPS• http://www.microsoft.com/downloads/details.aspx?FamilyId=4D951911-3ESDK for OpenXML formats• http://msdn2.microsoft.com/en-us/library/bb448854.aspxDeveloper community forum• http://openxmldeveloper.org/Open Source OpenXML/ODF converter (both ways)• http://sourceforge.net/projects/odf-converter/
  27. 27. Microsoft ventures into open access chemistryRoyal Society of ChemistryBy Richard van NoordenJanuary 29th, 2007http://www.rsc.org/chemistryworld/News/2008/January/29010803.aspComputational chemists have secured funding from computing giant Microsoft to showcase how chemistry can benefit from open access data sharing on theinternet.The two-year eChemistry pilot project represents a major test case for proposed new protocols for sharing scholarly information over the web, said Lee Dirks,director of scholarly communications at Microsoft Research. Microsofts support is also a boost for the small band of chemists keen to promote open accessinternet publishing.The public-private collaboration is one of many Microsoft projects to probe the potential of computing to advance scientific research,and bring back what they learn to improve the companys product line, Dirks told Chemistry World. But chemistry is a discipline weve nottypically worked in, he said. From everything Ive heard, its not as progressive a field as, say, astronomy in use of the web.Most chemical information on the web is published in closed journals and databases which guarantee high quality but also require a subscription to view. Pre-print servers, collaborative documents, open databases, video sites, online lab notebooks and blogs provide other ways of communicating research. Combiningthe lot offers the enticing prospect of a vast, free-to-access repository. This could transform the sharing of scientific research if the disparate datasources were machine-readable, so that a search engine could automatically gather data about a particular molecule from a crystalstructure, a movie, an online lab book, and an archived article, for example.Radical changeThe international standards required for this challenge are being developed by the Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE),based at Cornell University, Ithaca, US. Their model protocols will be officially launched on 3 March at Johns Hopkins University in Maryland.The eChemistry project, Dirks explained, was chosen as an exemplar to show that the new standards are actually useful to scientists. Chemists and computerscientists at Cambridge and Southampton universities in the UK, and Indiana, Cornell, and Penn State in the US, will search and index existing onlinedatabases and print archives; and work out how best to record chemistry data captured in lab experiments. The results will be hosted by the US NationalInstitutes of Health open access PubChem database and other repositories.
  28. 28. http://chronicle.com/daily/2008/02/1585n.htmMonday, February 11, 2008Researchers Develop Online Tools for Science CollaborationsBy LILA GUTERMANBlogs, wikis, and social-networking sites such as Facebook may get media buzz these days, but for scientists, engineers, and doctors, they are not even on the radar.The most effective tools of the Internet for such people tend to be efforts more narrowly targeted to their needs, such as software that helps geneticists replicate oneanothers experiments. That was the underlying message of many presentations at the annual conference of the Professional/Scholarly Publishing Division of theAssociation of American Publishers held here last week.Philip E. Bourne, a professor of pharmacology at the University of California at San Diego, spoke about the Web site SciVee, where scientists can linkvideos to their research papers that appear in open-access biomedical journals (The Chronicle, August 21, 2007). Mr. Bourne, who created the site,calls the videos pubcasts; they are typically about 10 minutes long and go into more detail than an abstract but less than the full-length article.The videos are coming in at a trickle, says Mr. Bourne. (He attributes the slow rate to the high quality: the graduate students and postdoctoralresearchers who make the videos have been crafting polished presentations.) But some of the ones already online have been viewed more than100,000 times. When the pubcasts are uploaded, Mr. Bourne has also witnessed a steep increase in downloads of the linked article.Jill P. Mesirov described an application that she hopes will ultimately become mainstream for journals that publish computational science. Ms. Mesirov,director of computational biology and bioinformatics at the Broad Institute of Massachusetts Institute of Technology and Harvard University, hasdesigned a way to make computational work repeatable by other scientists.The software, called GenePattern, stores both data and analytical routines. As the researcher works to collect and analyze the data, GenePatternrecords the steps the scientist has taken, so that anyone else can follow the steps and check the result or expand on the method using new data. Ms.Mesirov said that more than 6,000 people from more than 100 countries use the software.She is now working with Microsoft to link such information to manuscripts that could be published online by peer-reviewed journals, to givereaders access to a researchers computational methods. "One of the problems with publishing a paper that relies heavily on computational work,"she said, "is that all of the methods that you would need to reproduce it never appear in the journal. If youre lucky, theyre in the supplementary material[online]. How much better if the journal had a link to the paper which had the data and an instantiation of the method embedded right in that paper.”
  29. 29. How can we be sure we’ll remember our digital past?Christian Science MonitorBy Chris GaylordFebruary 13th 2008http://www.csmonitor.com/2008/0214/p13s02-stct.htmlFading media, formatsThe problem of digital preservation reaches across two standards. Theres the media – floppies, CDs, hard drives – and the format of the filesthemselves – does it run in DOS, Hypercard, ClarisWorks 2.0?Microsoft tackles this issue of "legacy" computing by running a kind of corporate museum. The company protects its multiplatform history bypreserving old copies of "every major hardware and software change," says Lee Dirks, director of Scholarly Communications at Microsoft and a taskforce member."Weve got computers stored on campus that go back to the Altair, the first computer [to run Microsoft software]," he says. "In fact, we boughtmultiple copies of the Altair just in case."But maintaining antique computers is a costly way to keep the past alive.A concept that is gaining momentum, Mr. Dirks says, is emulation, where programmers trick modern computers into thinking the waytheir classic cousins did. This lets them run old software without retro machines. Another problem arises when the emulator itself iswritten for last generations operating systems. Do you write an emulator to handle the original emulator?A more likely approach to long-term preservation is migration, says Berman. This calls for updating the file format every generation –without changing the contents, one hopes. This method has problems, as well. Some of the original context will be lost in translation,says Dirks. Also, the scale of the conversation will snowball as the number, size, and back-catalog of the files increases with eachpassing generation of technology.
  30. 30. • ICSTI Annual 2007 – Jun07• Nature Asia-Pacific Summit – Jun07• CODATA Summer School – Jul07• DCC Annual 2007 – Dec07• iSchool Conference 2008 – Feb08• OAI-ORE Launch – Mar08• BioMed Central 2007 Research Awards – Mar08• Open Repositories 2008 – Apr08• JCDL Annual 2008 – Jun08
  31. 31. • “Global Research Library 2020” with University of Washington (Oct07 and Mar08)• Participating in two application(s) to the final round of the NSF “DataNet” solicitation (as an unfunded partner)• Sponsoring BioMed Central’s 2007 Research Awards (Mar08)• Aug07 Issue of CT Watch Quarterly (v. 3, no. 3)“The Coming Revolution in Scholarly Communications & Cyberinfrastructure”http://www.ctwatch.org/quarterly/articles/2007/08/• New Scholarly Publishing website at:– http://www.microsoft.com/mscorp/tc/scholarly-publishing.mspx
