The title of this post sums up part of the discussion at the recent "Critical issues for the preservation of datasets" workshop I attended at the SLU (Swedish Agricultural University) in Uppsala, Sweden last month. It's not the whole story, but it reflects the constant tension that arises when the cost and effort of maintaining long-term data availability is balanced against the cost of, and desire for, new research. This workshop is the second I've attended that brought scientists, archivists, librarians, and IT specialists from universities, government and non-government organizations, and industry together to explore issues and solutions raised by long-term preservation of, and access to, scientific datasets.* One recommendation from this workshop is that these kinds of interdisciplinary interactions need to continue.

A mind-map outline of my notes is here:


What follows is a very brief summary of some of the questions and issues discussed at the workshop. The program can be found at the conference link above. I will post an update when the presentations are available.

(1) What institutions are best suited to preserving long-term archives of scientific data? Universities seem like prime candidates, because many have longevity. On the other hand, university departments do not seem to last. What about national archives? In the social sciences the European model seems to favor national archives. In North America social science data is managed by consortia.

(2) What are the future roles for national archives? The Swedish National Archives (Riksarkivet) is responding to e-government requirements for access and transparency. Is the national archive a repository or a safe haven? Is it a service center? Is it a resource for cultural heritage?

(3) Where are available datasets? and What is the quality of the data and their description? How can data authenticity be confirmed? These questions were asked repeatedly by the scientists and data managers. Scientists and researchers had plenty of stories of poorly documented and lost datasets.

(4) What about those datasets anyway? What are good archival formats for datasets and databases? The SLU Digital Preservation Pilot Project (in Swedish) is testing the feasibility of using DSpace and the OAIS framework to manage a collection of scientific datasets. This project is documenting the human effort required to both understand existing datasets and to build tools to provide effective markup useful for people and machines.

(5) Can the concepts and practices of archiving and records management help with the preservation of scientific datasets? One goal of archiving appraisal practice is to store enough information with the datasets so that future judgements can be made about their continuing value, reliability, authenticity, and accuracy. For a data collection to remain useful appraisal judgements need to be continually made by people other than those who collected and documented the data. One issue is that, today, criteria for appraising scientific datasets are in their infancy. We are still exploring how to document datasets for appraisal and other long-term purposes. The good news from workshops like this one is that people are working on it and talking about it.

Some of the discussion highlights from the two days:

  • The unrelenting issues of long-term preservation: responsibility, resources, and appraisal.
  • Scientists need simple data documentation tools and practices; their primary focus is research.
  • The emphasis on open access to data collections requires supporting a wide variety, and potentially large numbers, of users.
  • Get data producers (researchers and others) involved with preservation issues early. (The Swedish projects are doing this.)
  • Science publishers and researchers have different relationships to the data; both need to be accomdated. [This raises the question: How can open access publication practices accomodate data?]
  • Preservation projects need to examine data preservation risks up front (develop a threat model).
  • Successful curation starts by developing an exit strategy: How will you insure the collection exists when you can't or won't continue to care for it?
  • Digital data fragility needs re-examination; digital data might be more long lasting than we think.
  • Adopt a general systems perspective: avoid monocultures; data should be stored in many different kinds of systems; we need more experiences with DSpace, Fedora, Greenstone, etc.

One thing I noticed and appreciated at this workshop was the free discussion during the formal sessions. The panel sessions were particularly participatory. Of course, the "famous Swedish coffee breaks" also helped, and were long enough to make new acquaintances, and to carry on discussions outside the formal sessions.

* The first was a joint CODATA/ERPANET workshop on "Selection, Appraisal, and Retention of Digital Scientific Data" held in Lisbon, Portugal, in December 2003. A summary of that meeting was published in the CODATA Data Science Journal (along with several of the papers).

