CLARION project – Cambridge Chemistry Department
The data challenge
Chemistry laboratories produce many types of information and data – raw data, processed data, observations, chemical structures, reaction schemes, experimental write-ups, conclusions, graphs, images, crystallographic, spectroscopy data, papers, references, and so on. It is challenging to store this variety of information such that it is accessible and usable by a variety of users. The challenges include:
• Storing data in formats that allow its use by specialist data processing tools
• Using data formats that are suitable for publication and long-term preservation
• Allowing certain data to be used by people outside the department
• Motivating researchers to open their data
• Enhancing the meaning and context of the data to improve its usability
• Making the data searchable and easily navigable
• Ensuring that the system has minimal support overheads, yet continually evolves as required to meet changes in the IT environment.
Using an ELN
The Cambridge Chemistry Department has a basic repository which stores crystallographic data. Project CLARION (Cambridge Laboratory Repository In/Organic Notebooks) will create an enhanced repository that captures core types of chemistry data and ensures their access and preservation. The Chemistry Department is implementing a commercial Electronic Laboratory Notebook (ELN) system; CLARION will work closely with the ELN team to create a system for ingesting chemistry data directly into the repository with minimum effort by the researcher.
Enhancing and expanding data usage
CLARION will provide functionality to enable scientists to make selected data available as Open Data for use by people external to the department. The project will use techniques for adding semantic definition to chemical data, including RDF (Resource Description Framework) and CML (Chemical Markup Language). Much of these techniques will be extensible to other disciplines. CLARION will address general issues such as ownership of data, and it will publicise its results to the chemistry and repositories communities. Effort will be put into developing a sustainable business model for operating the repository that can be adopted by the department after project completion.
The project runs for two years from April 2009. The initial pilot deployment of the ELN is scheduled for late 2009, and we hope to be publishing open data from it in early 2010.