Producing self-describing datasets, which can be understood and properly re-used by colleagues or scientists is a goal of data management (Strasser et al. 2012). Other, more technical, goals include Improving data availability, conserving data resources, preserving data integrity and using familiar structure/ standards. Building on the article "FAIR Guiding Principles for scientific research data management" by Wilkinson et al. (2016) a variety of reports and project are developing specific guidelines and expectations for data management (e.g. European Commission Expert Group on FAIR Data Report: Turning FAIR into Reality, Enabling FAIR Data Commitment Statement)
“Research data are (digital) data that, depending on the scientific context, are related to, originate from, or are the result of a research process” (Kindling & Schirmbacher 2013). Research data are created by a variety of methods, depending on the research question. These include studying source material, experiments, measurements, descriptions, surveys, or polls. The data are the basis of scientific results. This results in the recognition of discipline- and project-specific data with different requirements for processing and managing such data.
Depending on the context, research data may include measurements, sensor data, laboratory results, audio-visual information, texts, survey data, software, simulations, images, objects from collections or samples that are the result of, were developed, or evaluated during scientific work.
Since research data are necessary to verify the results based on them, the preservation of such data is a recognised part of good scientific practice (e.g. Guidelines on the handling of research data of the German Research Foundation, in German: “DFG-Leitlinien zum Umgang mit Forschungsdaten”).
Journal articles and other traditional publications are far from being the only significant contributions in the advancement of knowledge in the 21st century (European Commission 2018). Research funding agencies and the scientific community increasingly demand open access to research data so that published research results can be verified and the data accessed for reuse. Publishing research data provides opportunities not just for researchers, but also for science in general:
Opportunities for researchers
Your research becomes more visible. Publications are cited significantly more often when the data are publicly available (Piwowar and Visions 2013).
The publication of research data is gaining more and more recognition as a scientific achievement.
You can increase the quality and credibility of your research by offering others a chance to verify your data.
You comply with the current requirements of the research funding agencies (see above)
You can secure your own research investment by setting blocking periods.
Opportunities for science
The publication of data opens up new potentials for research as data become available for re-analysis in the context of new research questions and methods or for combining data from different sources.
It also reduces the production of redundant scientific data, which saves time and money.
Some background information
Already in 2003, the “Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities” included “raw data and metadata, source materials, digital representations of pictorial and graphical materials and scholarly multimedia material” in their Open Access contributions. Ten years later, the science ministers of the G8 signed their commitment to Open Science: “…to the greatest extent and with the fewest constraints possible, publicly funded scientific research data should be open […] whilst acknowledging the legitimate concerns of private partners.” (G8 Science Ministers Open Data Charter, 2013).
Motivated by that, open data experts from governments, multilateral organizations, civil society and private sector, worked together to develop an International Open Data Charter (2015) that was adopted by more than 70 national and local governments with six principles for the release of data: (1) open by default, (2) timely and comprehensive; (3) accessible and useable; (4) comparable and interoperable; (5) for improved governance and citizen engagement; and (6) for inclusive development and innovation.
The geoscientific community engages internationally as Coalition for Publishing Data in the Earth and Space Sciences (COPDESS). The COPDESS Statement of Commitment (2015) and the more Recent “Commitment Statement in the Earth, Space and environmental sciences“ (2018, developed within the Enabling FAIR Data Project) were widely adopted by publishers, research data repositories and are major initiatives for open research data and FAIR data practices.
An increasing number of funding agencies (e.g., the EU Pilot on open research data in the HORIZON2020 programme and the DFG “Leitlinien zum Umgang mit Forschungsdaten” (Guidelines on the Handling of Research Data) and research associations/ institutes (Helmholtz Association, GFZ Data Policy), request scientists to publish research data.
Publishing data and software is the act of making data available for re-use by others via research data repositories. This provides a means to gain visibility and credit for one’s scientific endeavors, and offers increased transparency and reproducibility of the scientific process. A data publication represents a stand-alone and ideally self-describing, research product. Due to the combination of data and metadata and supported by the use of persistent identifiers (PIDs) and controlled disciplinary vocabularies, data publications can be regarded as a best practice for sharing data following the FAIR Principles for research data management. Data and software publications with Digital Object Identifers (DOI) are fully citable in research articles and should be included in the reference lists.
Yes. GFZ Data Services offer the possibility to temporarily restrict the data access by defining an embargo period. Even though the data are not publicly accessible during the embargo, the DOI is registered and citable. The metadata are published and findable by search engines and in catalogues. Please contact our staff for further information.
For more than a decade, publishers have recognized that inclusion of the full data in scholarly literature enhance the value and are part of the integrity of the research. Therefore they offered the chance to add electronic data supplements to scientific articles. A later investigation of the Coalition for Publishing Data in the Earth and Space Sciences revealed that “the vast majority of data submitted along with publications are in formats and forms of storage that makes discovery and reuse difficult or impossible” (COPDESS Statement of Commitment) and recommended data publications via research data repositories as best practice for data sharing.
They especially recommend “domain” repositories, i.e. research data repositories specialised to a scientific domain because these often have domain scientists curating the data and can be more specific with the metadata they use to describe the data than general repositories.
Many publishers, including Copernicus, Elsevier, Science, SpringerNature, Wiley, and societies such as the American Geophysical Union, the European Geosciences Union and the Geological Society of London have signed the Community Commitment Statements and have changed their data policies accordingly:
- they actively ask for the availability of data underlying scientific results
- recommend publishing research data through dedicated (domain) repositories
- allow the citation of data in reference lists of scholarly literature
- do not accept data supplements anymore
A domain repository is a research data repository specialised for one or few scientific domains. Domain repositories often have staff with domain expertise that usually provide a quality check on the metadata (and data). Domain repositories may also utilise standardised discipline-specific vocabularies. This greatly increases the discovery and reusability of data.
You may visit the Registry of Research Data Repositories (re3data) for finding an appropriate repository. re3data is a global registry of data repositories and portals that covers all academic disciplines.
“Metadata, the information we create, store, and share to describe things, allows us to interact with these things to obtain the knowledge we need.” (NISO 2017).
There are different types of metadata: (1) descriptive metadata (for finding or understanding a resource; (2) administrative metadata with the subgroups technical metadata (for decoding and rendering files), preservation metadata (long-term management of files) and rights metadata (referring to intellectual property rights attached to the resource); and (3) structural metadata (defining the relationship of parts of a resource to one another) (NISO, 2017: Understanding Metadata).
For researchers and data curators aiming at publishing data, the two aspects of descriptive metadata particularly important are contextual and discovery metadata.
- Discovery metadata include information on the authors and/or creators of the data, the title of the dataset, the year the data was published and the geographic location, a brief description of the dataset, keywords and cross references to related published articles, data, code or samples. For discovery metadata there are international standards across all disciplines. These may be complemented by controlled vocabularies used within specific domains)
- Contextual metadata are information required for reusing the data, such as an overview of the units of the variables in a table, information on data processing, instrument parameters, or processing steps. This type of metadata is often included in the header or made available in the form of README.txt files or other supplementary documents. In contrast to discovery metadata, contextual metadata are highly variable between the disciplines but key information for data reuse.
In order to make the automatic exchange of metadata possible, standardised, machine-readable metadata formats have been developed (XML, JSON). The GFZ Metadata Editor is one way to create metadata in XML format compliant with internationally agreed standards (i.e. DataCite, ISO19115/ 19139, Dublin Core, NASA GCMD .DIF). Recently initiatives to present machine readable websites are explored within schema.org and JSON.
A DOI is an online reference assigned to a digital resource (e.g. an article in a journal or research data) to give it a unique and permanent reference on the Internet. The DOIs are permanently connected to the digital resource – regardless of changes on websites or servers being shut down (in this case a DOI is simply rerouted to a new URL). The use of DOIs, for example, prevents the occurrence of dead links when publishers change the web address of a server. Among all the different ways to reference digital objects on the Internet permanently, DOIs have become the leading system when publishing text and data.
An ORCID iD is a persistent identifier uniquely identifying researchers and which supports automatic links among all your professional activities when integrated in grant, data or manuscript submission workflows. In contrast to other author identifiers, like Researcher ID or Scopus ID, ORCID is an international non-for profit organisation that is institutional independent and globally accepted.
The Crossref Funder Registry provides a means to precisely reference a funding organisation. It offers common taxonomy of over 13,000 international funding organization names together with DOIs for each entity. The registry is donated by Elsevier and is updated and reviewed approximately monthly.
GFZ Data Services has no special data format requirements for submissions. Wherever possible, community data standards and open data formats are preferred. Nevertheless, files submitted for publication may be converted by the editors (e.g. tab separated text files for tables). Please visit our data file instructions. We will be happy to advise you.
In general, the following applies: data should be exchangeable without barriers and readable by others. Ideal formats are non-proprietary, unencrypted and commonly known across your research community and are based on open, documented standards. If the problem of proprietary formats occurs, in particular in the case of commercial software, you may be able to convert the data into open, standardised formats. Open and common formats are always preferable to proprietary formats if they achieve the same results or can be used accordingly without much effort. However, some data loss may occur when converting to an open format. In this case, you could provide the data in both open and proprietary formats.
The registry of Research Data Repositories (re3data) is a global registry of data repositories and portals that covers all academic disciplines. It presents repositories and portals for the permanent storage and access of research datasets to researchers, funding bodies, publishers and scholarly institutions and promotes a culture of sharing, increased access and better visibility of research data.
Datasets that have been assigned a DOI must not be changed. However, a new version of a DOI can be assigned when data and metadata are updated, in the cases of constantly growing dynamic datasets, such as the time series from a climatological station, or when new entries are added to a database. Here, new data can be added to the existing dataset without changing the DOI IF the already published data have not been changed. However, the moment the already published dataset is changed (e.g. after the removal of outliers or a recalibration) a new version of the DOI must be created. When a DOI gets a new version, this will be indicated in both the original and the new version.
The IGSN (International Generic Sample Number) is a globally unique and resolvable persitent identifier (PID) for physical specimen that provides discovery functionality of digital sample descriptions via the internet. IGSNs can be assigned to e.g. rock, water, vegetations samples, sediment cores, as well as related sampling features (sites, stations, stratigraphic sections, etc.).
IGSN is governed by the IGSN e. V. and has been developed to (1) address requirements for reproducibility of sample-based data, (2) ensure discovery, access, and re-usability of samples and data derived from them, (3) recognize sample collection and curation as scholarly contribution to the scientific community, and (4) improve data integration. IGSNs are citable, should be included in the data tables in scholarly literature and herewith close one of the last gaps for the full provenance of research results.
A persistent identifiers (PID) is a long-lasting reference to a resource that is resolvable to the internet. PIDs are essential to uniquely identify data, software and literature, physical samples, people and organizations in the digital world. You may be familiar with Digital Object Identifiers (DOIs), Open Researcher or Contributor IDs (ORCID). A fairly new PID is the International Generic Sample Number (IGSN) for physical samples.
To be FAIR, digital data has to be Findable, Accessible, Interoperable, and Reusable. These principles emphasize machine-actionable data and metadata and aid new discoveries through the harnessing and analysis of multiple datasets. At the same time, the principles support knowledge discovery and innovation, data and knowledge integration, and foster sharing and reuse of data (ANDS).
The FAIR Principles
To be Findable:
F1. (meta)data are assigned a globally unique and persistent identifier (e.g. DOI)
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards