Open Access to data
Because the organisation and use of data via data centres and data sharing is becoming more and more important for research, it is essential that not only publications but also research data be openly accessible. And since every publication in the field of empirical sciences is based on data, the Berlin Declaration on Open Access (OA) applies just as much to data as it does to publications.
Research data can be integrated in publications, documented indirectly, for example via links in publications, or made available in the form of independent data sets. Data is mainly collected in academic and university research (small science). Because of the wide range of research conducted in this area, it offers the greatest potential for providing open access to and permitting the (re-) use of data.
Since research data is becoming more and more extensive and complex, it is rarely presented in the publications themselves, for example in tabular form. Recent cases of data manipulation and forgery highlight the importance of Open Access to the original data as a means of ensuring the verifiability and reproducibility of research results.
Big science is particularly data-intensive. For example, work in disciplines such as bioinformatics, (empirical) geoscience and environmental sciences is based primarily on data which is collected, analysed and interpreted collaboratively. Indeed, big science is mainly organised collaboratively and furnishes prime examples of the current structural transition to e-Science. Collaborators are linked as users and suppliers via data sharing, and the data is stored in data centres or databases which are often linked or grouped together in clusters.
Because of the added value it brings, Open Access to data is especially worthwhile and gives research completely new opportunities. GenBank and the Protein Structure Database are two exceptionally successful examples: "The success of the genome project is in no small part due to the fact that the world's entire library of published DNA sequences has been an open access public source for the past 20 years. If sequences could be obtained only in the way that traditionally published work can be obtained – there would be no genome project" (Patrick Brown 2004). Another example is the fact that, by using historical DNA, environmental and other data, it was possible to find cholera distribution patterns which would not otherwise have been detectable.
Advantages of Open Access to data
In a nutshell, the main advantages of Open Access to data are:
- Research results based on data can be verified and critically examined.
- Unnecessary duplication of research work can be avoided.
- Data can be analysed comprehensively and made use of, for example in follow-up projects.
- The research process can be accelerated through data sharing.
- New findings can be achieved by merging data from different sources.
- The merging of data brings an informal added value and yields higher-quality data products, for example indices and data bases.
- Data sets which are collaboratively assembled and jointly used are more cost efficient.
- Open Access promotes re-use of data by the public and by industry.
Promotion of Open Access to data by scientific organisations
In disciplines such as astrophysics, high energy physics and molecular genetics it is customary for data either to be made accessible shortly after collection, or to incorporate links to data sources in publications, or to deposit the data on which the publications are based in a central database.
CODATA (Committee on Data for Science), a sub-organisation of the International Council for Science (ICSU), is the international organisation in the field of quality management and the exchange of scientific data. In its Principles for Dissemination of Scientific Data published in 2002, CODATA endorses Open Access to data.
In its Declaration on Access to Research Data from Public Funding, the OECD's Committee for Scientific and Technological Policy (CSTP) expresses its commitment in principle to Open Access to research data while giving due consideration to intellectual property and economic interests. For the National Institutes of Health (NIH), data sharing is a term and condition of the award of grants of $ 500,000 upwards. In its Policy on data management and sharing (January 2007), the Wellcome Trust requires that data generated by the research which it funds be shared, and - in line with its Position statement in support of open and unrestricted access to published research - makes grant approvals conditional upon the provision of the freest possible access to research results. The strategy and work programme of the Helmholtz Association provide for the storage of primary scientific data in the organisation's own data centres. The German Research Foundation (DFG) obliges grantees (only) to archive data for a minimum period of five years.
Prerequisites for data publication
Some of the prerequisites for data publication such as integrity and long-term availability are the same as those which apply to scientific and scholarly publications. The following criteria are important:
- long-term findability by means of persistent identifiers. The most commonly used identifiers are Digital Object Identifiers (DOIs). The DOI agency for scientific raw data in Germany is the German National Library of Science and Technology in Hanover (TIB) which was the first institution in the world to assume this function.
- the recording of metadata for each data object or set. This is essential in the case of data which is not linked to a particular publication but which, for example, is stored in data bases. Indexing is carried out in accordance with ISO standards and includes discipline-specific elements (for example the IUPAC Chemical Identifier in the case of chemistry). Simpler metadata sets such as Dublin Core are suitable for the self-archiving of data in small science.
- the incorporation of source references and licence conditions in the data files, for example via a uniformly coded identifier.
The German section of CODATA initiated a project entitled Publication and citation of scientific primary data which was funded by the German Research Foundation (DFG) from 2003 to 2005. The project was realised collaboratively by TIB and the four German World Data Centers in the field of geosciences.
Legal issues
There are specific legal issues associated with Open Access to research data. Up to recently, authors of data were advised to use the Creative Commons licence system to safeguard their rights when making their data openly accessible. Work is currently in progress on a new licence which will comply with recommendations made by initiatives such as Science Commons with regard to the implementation of Open Access to data.
Science Commons was launched in 2005 under the auspices of Creative Commons in order to meet the complex demands in the area of Open Access to scientific and scholarly data, tools and materials. The goal of Science Commons is to facilitate access to and the use and re-use of research data, and to identify and dismantle unnecessary barriers to the exchange of such data. 2005 also saw the initiation of another project dedicated to open access to data – the Global Information Commons for Science. It was launched jointly by CODATA, World Data Centers (WDC), the OECD, Science Commons and other organisations with the aim of coordinating the various initiatives dedicated to Open Access to research data, and, especially, of facilitating the re-use of the results of publicly-funded research.
EU legislation represents a barrier to the Open Access to data because it establishes for data products in EU member states a sui generis right regardless of whether copyright exists. As a result, in EU member states at least, this data cannot be used by others without the permission of the rights holder. In the case of data produced within the jurisdiction of German federal ministries and agencies, its collaborative use within the meaning of Open Access is hampered by the fact that data-producing institutions (for example land surveying offices, the German Remote Sensing Data Centre (DFD) and the German Weather Service) are partly self-financing and need the revenue from the sale of their data.
Infrastructure
The support and promotion of Open Access to data calls for a suitable infrastructure, especially with regard to mass research. The main organisations responsible for building data centres are research funders, universities and public research bodies. They are also the competent authority with regard to the formulation of policies on the selection, access and use of the data accumulated within their area of responsibility.
Collaborative and discipline-specific initiatives devoted to Open Access to data
At present, the main activities and initiatives devoted to Open Access (OA) to data are discipline-specific. They can be classified as follows:
- OA data centres and archives (for example GenBank, Protein Data Bank, Digital Sky Survey)
- Virtual observatories (for example International Virtual Observatory for Astronomy, Digital Earth)
- Distributed OA data networks (for example World Data Centres [WDCs], Global Diversity Information Facility, NASA Distributed Active Archive Centers). Four of the 52 World Data Centres are located in Germany. They have formed a cluster to promote Earth System Science.
Open Access to data: prospects and barriers
There are many barriers to the Open Access to data:
- While big-science organisations and programmes generally have suitable data repositories of their own, the necessary infrastructure – for example suitable databases – for widespread data sharing are still lacking. One possibility might be to follow up on the DFG-funded project Publication and citation of scientific primary data.
- Authors of data fear that it might be used by other scientists without due attribution or that their exploitation rights might be unreasonably restricted (see Legal issues).
- At present, the work that goes into processing data and making it available online receives little recognition in the scientific system and the time expended tends to impact negatively on the author's scientific career. Therefore, to motivate authors it is essential that making data openly accessible be recognised as an independent citable publication and a scientific achievement.
Data sharing – especially open data sharing – opens up new synergetic potential to research in all areas in which data is used or collected. As a result, it is an important issue for research and scientific funding.
In June 2009, the Electronic Publishing Working Group of the German Initiative for Networked Information (DINI) published a position paper on the subject of research data in collaboration with the Helmholtz Open Access Coordination Bureau.
Further information on Open Access to data can be found on the Helmholtz Association web pages.
















