Although presented in text form, scientific knowledge is normally based on the analysis of research data. Open access (OA) to data means making these data available to users and other researchers for (re-)use with as few restrictions as possible. The provision of OA to research data is often backed up with the argument that it facilitates the verifiability and reproducibility of the scientific results formulated in the texts. Hence, the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities does not limit the objects for which it aspires to achieve OA to scholarly and scientific documents but also includes research data. In the view of the Alliance of Science Organisations in Germany, the advantage of OA to research data is not only the transparency and quality assurance it creates by rendering research reproducible in principle but also the fact that it increases efficiency and saves costs by making it possible to conduct secondary analyses.
The arguments presented in favour of OA to texts and data are similar in part: in both cases, OA is expected to accelerate research and increase its efficiency by making access to scientific and scholarly knowledge as barrier-free as possible. A further argument in favour of OA that applies equally to scientific and scholarly texts and to research data is that they have been generated with the help of public funding. It is interesting to note that the requirement that OA be provided to research data is not always linked to the provision of OA to text publications, as closed-access journals sometimes also require their authors to provide OA to the data on which their articles are based.
Positions of academia and research funders
The criteria that OA to research data should fulfil are formulated in the Panton Principles. Drafted by a group of academics, the principles call for OA to research data on the internet, “permitting any user to download, copy, analyse, re-process, pass them to software, or use them for any other purpose without financial, legal, or technical barriers.”
Research funders no longer regard texts as the only significant scientific output, either. In its Grant Proposal Guide published in 2013, the U.S. National Science Foundation (NSF) no longer requires applicants to provide details of relevant publications as evidence of their expertise but rather details of research products, which also include data and software. Horizon 2020, the EU framework programme for research and innovation, also features a Research Data Pilot in which selected disciplines and funding areas are required to provide OA to their research data at the earliest opportunity.
The strategy and work programme of the Helmholtz Association includes the storage of primary scientific data at the data centres of the association. The German Research Foundation (DFG) requires project participants to archive data for a period of (only) ten years. Since 2011, proposals submitted to the NSF must include a data management plan (DMP) that outlines how the applicant intends to comply with the foundation's data sharing policy. The requirements of the Wellcome Trust go even further. Since 2007, its policy on data management and sharing stipulates that data generated by Trust-funded research must be shared and made openly accessible immediately after publication of the research results. Since 2003, the U.S. National Institutes of Health (NIH) require recipients of funding of 500,000 USD or more in any year of the proposed project period to share their final project data. The provision of OA to the data must take place after publication of the main research findings. CODATA (Committee on Data for Science), a science committee of the International Council for Science (ICSU), is the international organisation in the area of the quality management and exchange of scientific data. In its Principles for Dissemination of Scientific Data, which were published in 2000, CODATA supports OA to data.
The Austrian Science Fund (FWF) also requires research data generated within the framework of FWF-funded projects to be made openly accessible, if this is legally and ethically possible. The following conditions must be fulfilled: a suitable repository must be chosen; the research data must be in a citable format; and unlimited re-use must be ensured.
Data generated with funding from the Swiss National Science Foundation (SNSF) must also be made available to other researchers and be self-archived in recognized collections of scientific data in accordance with the rules of the SNSF.
The Swiss University Conference (SUC) funding programme 2013-2016 P-2 “Scientific information: access, processing and safeguarding” promotes, inter alia, research data management and the publication of research data, for example within the framework of the project “Pilot-ORD@CH”. Moreover, the Swiss Centre of Expertise in the Social Sciences (FORS) has published a policy statement endorsing open data. Together with the Swiss Academy of Humanities and Social Sciences it has drawn up a manifesto on Data Access and Research Transparency (DART).
The increased importance of OA to text documents and scientific data (and to scientific software) reflects the advent of transparent and open science. However, the increased relevance of OA to research data is reflected not only in the policies of research funders and science organisations. Rather, research data are becoming stand-alone scientific objects that are treated and cited in the same way as texts. Hence, new and independent opportunities for self-archiving or publishing research data are emerging. Following Dallmeier-Tiessen (2011), they include:
- Self-archiving research data as stand-alone objects in a research data repository. These repositories, for example DRYAD in the biosciences, are devoted solely to research data.
- Publishing research data with a textual documentation in a data journal. These data journals, for example Earth System Science Data (ESSD) in the geosciences, even have some form of quality assurance. However, it is usually limited to the description of the data.
- Publishing research data as a means of enriching an interpretative text publication (usually a journal article). In this third scenario, data can either be deposited on the journal website or, as mentioned above, they can be self-archived in a research data repository, where they can be linked to the journal article.
Big science and data driven science
For the most part, data are generated within the framework of conventional academic research (small science). Because of the broad range and the diversity of research, this area has the greatest potential of OA for the (re-)use of data. In a very different way, cases of data manipulation and forgery underline how necessary OA is for the verification and reproducibility of research results. As research data are becoming more and more extensive and complex, they are now rarely presented – for example in tabular form – in the actual research publications. This availability problem can be solved by specialised research data repositories where scientific data can be self-archived, thereby making them publicly available.
Big science is particularly data-intensive: disciplines such as bioinformatics and the (empirical) geosciences and environmental sciences are based primarily on data that are often collaboratively generated, analysed, and interpreted by numerous institutions. Indeed, the organisation of big science is mainly collaborative, and models for the structural shift towards e-science can be found here. Collaborators are linked via data sharing as users and suppliers of data, and the data are stored in data centres or databases that are often networked or clustered.
GenBank and the Protein Structure Database are two success stories that illustrate the advantages of OA to data. In the words of Patrick Brown, Professor of Biochemistry at Stanford University: “The success of the genome project is in no small part due to the fact that the world's entire library of published DNA sequences has been an open access public source for the past 20 years. If sequences could be obtained only in the way that traditionally published work can be obtained – there would be no genome project" (National Academies, 2004: 37). Another example of the application of research data is the identification of the patterns of the spread of cholera using historical DNA, environmental, and other data. These patterns would not otherwise have become visible.
OA to research data is also the basis of data-driven science (Grey, 2007; Hey, Tansley & Tolle, 2009), which is regarded as a completely new research paradigm that succeeds the paradigms of
- Purely empirical science based on observation
- Science based on theory and model building, and
- Science that investigates complex phenomena in simulation studies by means of information technology
and that produces scientific knowledge by exploring large volumes of accessible data (Büttner, 2011). The more data that are openly accessible – that is, free of most usage restrictions – the more successful data-driven science will be.
Advantages of Open Access to data
In sum, the main advantages of OA to data include:
- The critical verifiability of data-based research results
- The avoidance of unnecessary duplication of research work by enabling secondary analyses to be conducted
- The extensive scientific analysis of data and their use (e.g. in follow-up projects)
- The acceleration of the research process through data sharing
- The generation of new knowledge by pooling data from different sources
- Informational added value and the creation of higher-grade data products (e.g. indices, databases) by pooling data
- Improved cost efficiency of data collections that are collaboratively built and used
- The promotion of the public and commercial re-use of data
- Increased citation of texts when access is provided to their underlying data
- Receipt of a scientific reward for providing access to the data through citation of the data themselves
Requirements that the publication of data must fulfil
A number of requirements, such as data integrity and long-term availability, are the same as those that apply to scientific and scholarly text publications. It is important to
- Ensure long-term discoverability of the data by means of permanent addresses (persistent identifiers). The preferred identifiers are digital object identifiers (DOIs). The registration of research datasets and the assignment of DOIs take place in the DataCite network. Germany is represented in DataCite by the German National Library of Science and Technology (TIB), the German National Library of Medicine (ZBMED – Leibniz Information Centre for Life Sciences), GESIS – Leibniz Institute for the Social Sciences, and the German National Library of Economics (ZBW); Switzerland is represented by the Swiss Federal Institute of Technology (ETH).
- Record the metadata of the data and the data collections. This is essential in the case of data that are archived independently of the corresponding publications, for example at data centres. The metadata are created on the basis of ISO standards and include discipline-specific elements (e.g. the IUPAC chemical identifier in chemistry). When self-archiving data in small science, simpler schemas such as Dublin Core can be used.
- Anchor references and licence terms and conditions in the data files (e.g. via uniformly coded identifiers).
The assignment of DOIs (or other persistent identifiers that guarantee the citability of the data) is of great importance because they render the data citable, and because citations constitute a kind of reward for providing access to the data.
In the area of OA to research data there are specific legal problems and requirements. Keywords here are data protection and personal data, questions relating to the level of creativity or originality of the data (are they protected by copyright in the first place?), and database rights. The technical requirements that apply to the sometimes very large volumes of data, some of which are in special formats, differ from the usual (low) requirements that text repositories must meet.
Especially with a view to data-intensive research, the support and promotion of OA to data requires a suitable infrastructure. The organisations responsible for building suitable data centres are funding agencies, universities, and public research institutions. These organisations are also responsible for formulating policies for the selection, accessibility, and use of the data or information generated in their respective areas.
Collaborative and discipline-specific initiatives
The currently dominant activities and forms of organisation in the area of OA to data tend to be discipline-specific. The following categories can be distinguished:
- Open access disciplinary data centres and archives (e.g. GenBank, the Protein Data Bank, and PANGEA)
- Mainly multidisciplinary research data repositories maintained by individual institutions such as universities, for example Open Data LMU at the University of Munich
- Repositories such as Zenodo, figshare, and LabArchives, which are open to researchers from various fields and institutions
- Distributed OA data networks such as the International Council for Science's World Data System (WCS)
- Discipline-specific initiatives such as the German Data Forum (RatSWD), which has set itself the task of bundling infrastructural competence in the social and economic sciences to ensure that researchers in the quantitative social and economic sciences have access to decentralised databases for research data
Overviews of research data repositories can be found in the Open Access Directory's list of repositories and databases for research data, in DataCite's list of repositories, and especially at re3data - the Registry of Research Data Repositories. Compared to the first two offerings, the re3data service offers a database-based search that also provides more detailed information about databases, data formats, and licence terms and conditions.
Future prospects of, and barriers to, Open Access to data
The following barriers to providing OA to data are sometimes mentioned:
- While most of the institutions and programmes in the area of big science have suitable data repositories at their disposal, the necessary infrastructure for comprehensive data sharing is still lacking, particularly in smaller institutions. This problem is being addressed by projects such as RADAR – the Research Data Repository, which aims to create a research data infrastructure that will promote research data management.
- In the current science system, commitment to processing data and making them accessible earns little recognition. And because of the time it takes, it does not have a very positive effect on the researcher's academic career, either. Therefore, in order to motivate authors to make their data openly accessible, it is essential that the provision of access in this way be recognized as a form of stand-alone citable publication and as a scientific achievement in its own right.
Data sharing – in particular the provision of OA to research data – opens up new opportunities to researchers in all areas in which data are used or generated. It also facilitates transparency and reproducibility, cost efficiency through re-use and secondary analyses, and new research approaches in data-driven science.
Büttner, S., Hobohm, H.-C. & Müller, L. (2011). Research Data Management. In S. Büttner, H.-C. Hobohm & L. Müller (Eds.), Handbuch Forschungsdatenmanagement (pp. 13–24). Bad Honnef: Bock + Herchen. Online: http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:525-opus-2412.
Dallmeier-Tiessen, S. (2011). Strategien bei der Veröffentlichung von Forschungsdaten. In S. Büttner, H.-C. Hobohm & L. Müller (Eds.), Handbuch Forschungsdatenmanagement (pp. 157–168). Bad Honnef: Bock + Herchen. Online: http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:kobv:525-opus-2412.
Gray, J. (2007). eScience -- A Transformed Scientific Method. Mountain View, Canada: Conference of the National Research Council und Computer Science and Telecommunications Board. Online: http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt.
Hey, T., Tansley, S. & Toll, K. (2009). Jim Gray on eScience: A Transformed Scientific Method. In Microsoft (Ed.), The Fourth Paradigm: Data-Intensive Scientific Discovery (Volume xvii–xxxi). Redmond, Washington. Online: http://research.microsoft.com/en-us/collaboration/fourthparadigm/.