CLARION Design Principles

May 13, 2010

“Principled Design” is a term used occasionally (especially in computer architecture) to describe an approach in which you list and explain your assumptions and constraints before describing the design itself. This has a number of benefits: –

  1. It’s often more constructive to debate assumptions than details
  2. A set of principles is useful when considering alternatives when designing an architecture.
  3. A design will often flow from the assumptions and constraints; explaining the principles is often the easiest way of explaining the design.

The architecture of CLARION is based on a number of guiding principles: –

Subsystems exist on the web

We have designed the system as a set of applications with defined roles, rather than as a monobloc. This is sometimes a practical necessity, and allows us to keep the parts relatively simple. By using web standards for interfaces (RESTful HTTP based APIs, Atom feeds for data transfer) and making sure each sub-system implements its own security, we keep the parts simple and increase the flexibility of the system. It will hopefully also lead to easier re-use and interoperability.

Prefer to reference rather than duplicate

We avoid making duplicates of data unless there’s a need. Our Embargo Manager component only holds metadata, and if access control is required, it proxies through to the original resources. We make a duplicate copy for the repositories Chem1 and Chem0, since we assume that the originating systems probably won’t provide curation services.

Prefer to do things automatically

It should go without saying, but we try to automate anything that could create unnecessary work for users.

Manual semantification as early as possible

The SPECTRa project report (http://www.lib.cam.ac.uk/spectra/documents/SPECTRa_Final_Report_v10.doc) described a “Golden moment” at which the researcher is concentrating on the data, and can easily add informative metadata without too much thought or recollection. If we can’t automatically calculate metadata, it should be collected from the user as soon as possible.

Automatic semantification as late as possible

More automatically produced metadata produced early in the process means more data to deal with, and especially more provenance to keep track of. This is wasted, since we could produce this metadata whenever we choose, and since we are continually improving our metadata extraction routines, the best metadata will be produced last.