Thu, 04 Mar 2004 14:16:08 GMT

Accessing and preserving data. Tony Hey and Anne Trefethen, The Data Deluge: An e-Science Perspective, forthcoming in F. Berman et al. (eds.), Grid Computing, Wiley. The preprint is on Hey's web site. Abstract: “This paper previews the imminent flood of scientific data expected from the next generation of experiments, simulations, sensors and satellites. In order to be exploited by search engines and data mining software tools, such experimental data needs to be annotated with relevant metadata giving information as to provenance, content, conditions and so on. The case for automating the process of going from raw data to information to knowledge is briefly discussed. The paper argues the case for creating new types of digital libraries for scientific data with the same sort of management services as conventional digital libraries in addition to other data-specific services. Some likely implications of both the Open Archives Initiative and e-Science data for the future role for university libraries are briefly mentioned. A substantial subset of this e-Science data needs to archived and curated for long-term preservation. Some of the issues involved in the digital preservation of both scientific data and of the programs needed to interpret the data are reviewed. Finally, the implications of this wealth of e-Science data for the Grid middleware infrastructure are highlighted.”

Also see Why engage in e-science? Library Information Update, March 2004, an anonymous commentary on the Hey-Trefethen article. Excerpt: “Librarians may not have noticed, but there is a revolution going on – the democratisation of science. This is a sub-agenda of the campaign to persuade researchers to deposit their research results in open access archives. It is not all about breaking commercial publishers' monopoly of copyright in scientific journals. It means that someone who didn't do the original research will be able to analyse someone else's data and even win the Nobel Prize, using that data. And it reflects the fact that, in science and engineering, at least, the data changes. This creates its own challenges, because research databases have complex metadata and you need to make sure that the metadata also changes appropriately – which is where librarians come in.” [Open Access News]


it would be good if we could develop and require standards for publishing, which would then allow us to build a document infrastructure upon those standards. right now, the difficulty is that very little of the content exists in a standard compliant format and thus parsing becomes labor intensive. but it could be done, and it probably should be done, but it will be expensive work if it is performed in the academy, and not outsourced like Time did for their archives.