A goal for the CLARION project is to make it easier for scientists to release their experimental results into the public domain as Open Data. We’ve been talking to some Principal Investigators in the Chemistry dept to hear their attitudes towards releasing data.
For all the PIs interviewed, the need to release data is not in the forefront of their mind. During the introductory preamble, they tend to look at you with a “Why are you asking me?” expression. The trick to make them think about open data is to find an angle that concerns them. Three things that are close to a PI’s heart are money, publications, and visibility. Good questions to ask are:
• Do any of your public-funding agencies require you to make your research results public?
• Have you needed to provide supporting experimental data for any of your papers?
• Would you like to increase the visibility and citations of your work?
Questions such as these help the researcher to realise that making their research data open could be advantageous to them – and that IT solutions could help them do it.
Almost without exception the PIs approve of the concept of Open Data. Several of them actively post data into public databases such as the Protein Data Base. However, we find a range of opinions as to the timing of release. Some are happy to release data almost as soon as it’s collected; others after a paper has been published; and others would only do so after any intellectual property had been patented. As might be expected, the desire to protect intellectual property seems to inversely correlate with the “pureness” of the work. The more applied the science, the more patentable the work, and hence the need to be sensitive to protecting IP.
A common concern from a PI is whether their group’s data would be useful to anyone else. Difficult one to answer this with anything beyond “Well, you never know until you try it”. But again a good way to help them think is to ask them what data they’d like to see from other researchers in their area, or from papers that they’ve read. Just about always they will say that there’s something they’d like to see – commonly, the supporting data used for a graph.
A diversity of opinion is good; diversity is the seed from which the fittest will grow. However, it does tend to complicate any IT solution…
Sam Adams joined CLARION a few weeks ago as the project’s software developer. Sam has just submitted his PhD in chemistry informatics, and has lots to offer the project. We’re already starting to take Sam’s prototype of the embargo manager component around the chemistry department for feedback on the UI and on open data publication.
I’m hoping Sam will be able to make JISC Dev8D event, so I can introduce him to many of you in person there.
At the JISCRImeeting the other week I was very interested to hear about the EP2DC project. It’s clear that finding, clearing and publishing data that supports already-published research is going to be an important aspect of CLARION (although ePrint publication will be a secondary concern) so it will be interesting to see what the EP2DC team come up with, and to share experiences.
The advert has finally come out. We’re looking for an experienced Java programmer, with skills in as many of the following as possible: XML, OO design, SPARQL, RDF, RESTful web development, Clojure (or Scheme), Triplestores. This is an exciting opportunity to apply interesting technologies to a challenging and worthwhile application: enabling the publication of Open Data to support science.
Closing date for applications is 24 Aug. See the university jobs website or next week’s Cambridge Evening News for details on how to apply.
Brian and I are attending the JISC programme meeting at Leicester Uni. We’ll be presenting CLARION very (like, 30 seconds) briefly tomorrow morning, but if anyone wants to chat about the project this evening then come and grab us!
CLARION is starting the process for recruiting a developer. We need someone who is smart, with experience with open software techniques and tools; good knowledge of XML; ideally someone who has experience within a scientific environment. And experience of RDF or CML would be the icing on the cake! Work is based in Cambridge. Do you know of anyone suitable…?
The data challenge: Chemistry laboratories produce many types of information and data – raw data, processed data, observations, chemical structures, reaction schemes, experimental write-ups, conclusions, graphs, images, crystallographic, spectroscopy data, papers, references, and so on.It is challenging to store this variety of information such that it is accessible and usable by a variety of users.The challenges include:
Storing data in formats that allow its use by specialist data processing tools
Using data formats that are suitable for publication and long-term preservation
Allowing certain data to be used by people outside the department
Motivating researchers to open their data
Enhancing the meaning and context of the data to improve its usability
Making the data searchable and easily navigable
Ensuring that the system has minimal support overheads, yet continually evolves as required to meet changes in the IT environment.
Using an ELN: The Cambridge Chemistry Department has a basic repository which stores crystallographic data.Project CLARION (Cambridge Laboratory Repository In/Organic Notebooks) will create an enhanced repository that captures core types of chemistry data and ensures their access and preservation.The Chemistry Department is implementing a commercial Electronic Laboratory Notebook (ELN) system; CLARION will work closely with the ELN team to create a system for ingesting chemistry data directly into the repository with minimum effort by the researcher.
Enhancing and expanding data usage:CLARION will provide functionality to enable scientists to make selected data available as Open Data for use by people external to the department.The project will use techniques for adding semantic definition to chemical data, including RDF (Resource Description Framework) and CML (Chemical Markup Language). Much of these techniques will be extensible to other disciplines.CLARION will address general issues such as ownership of data, and it will publicise its results to the chemistry and repositories communities.Effort will be put into developing a sustainable business model for operating the repository that can be adopted by the department after project completion.
Timelines: The project runs for two years from April 2009. The initial pilot deployment of the ELN is scheduled for late 2009, and we hope to be publishing open data from it in early 2010.
We’re happy to publicly announce the CLARION project, funded by the JISC to enhance the existing data repository at the Chemistry Department of the University of Cambridge, especially by integrating it with an Electronic Lab Notebook system.
You can read a little more about the project, tweet us @clarionproject or refer to us as #clarionproject.