Many journals require that data supporting publications be made available. For example, Science is particularly precise in requiring that "All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science." and "Citations to unpublished data and personal communications cannot be used to support claims in a published paper. Is this a policy that deserves broader adoption?
- Should data that are necessary to assess or replicate a peer-review publication be cited?
- When should data collections be assigned their own citation, and when should it be cited by proxy to a related article?
- When is it appropriate for data collectiosn to cite other data or articles?
- How persistent must cited data be? What evidence of persistence, if any should be provided?
Citations to data often differ in location and format even within a single journal. Within a particular journal should citations to data be systematized?
- Within a particular journal, should citations to data appear in a single place in every article?
- Should these citations be included with citations to articles and other works? Or should these be separated in the text, acknowledgements, substantive footnotes, or notes to tables and figures?
- Within a particular journal, should citations to data follow a consistent format?
- Citations to data are sometimes containing only a single element, such as title of the database used, and sometimes contain many elements such as the file format of the object (e.g. SPSS), file type (e.g. "computer file"/"online database"), variables used, and retrieval date. What are the minimal elements necessary in a data citation to support location, assessment, and credit/attribution, etc?
- Is a persistent, globally unique identifier required? If so, must the identifier be of a particular form, such as a DOI?
- How durable must citations to data be? Should they more or less durable than citations to other works?
- Are datasets required to have titles and authors?
- Must formatting/file type information be include?
- What additional information is not mandatory, but is most valuable for locating, or assessing the data?
Attribution stacking has been raised as a concern for data citation in large integrated and federated databases. For integrated/federated data, what attribution is required for data citation?
- Must data sources included in the universe/population of analysis used in an integrated/federated database be cited, or only the subsample used?
- Can the aggregation be cited alone, or must attribution be given to each source as well?
- If attribution is given to each source, can this be done by reference? Is it sufficient to cite an aggregator if the aggregator cites all sources?
- What other provenance information is required in a citation?
- Who (e.g. citing author, cited work author, journal) should decides what attribution is necessary?
Current indexing services and tracking tools often do not support support queries or analysis of the data citation currently included in articles (even if these include DOI's), and many strip out identifiers to non-traditional works. What is necessary to extend tools designed for citing and tracking use of other works for data citation?
- Should data citations be indexable?
- Would it be useful for journal indexing and citation services to track data sets as well? Or is a separate service preferable?
- What are the technical and institutional barriers to including data citations in existing tools and services?
- What additional tools, not currently in existence, are needed for, or would add substantial value to, data citations?
Different disciplines may have disparate needs for granularity at which digital “objects” are identified. What are the differences among disciplines that need to be addressed distinctly?
- Is there a feasible non-disciplinary specific way of providing deep/granular citation to datasets? If not, how is interdisciplinary research data cited?
- Is there a minimum granularity required for any data citation?
- Who decides (e.g. citing author, cited work author, journal) what the minimum granularity required is for citation?
Datasets are more dynamic than traditional documents. When and how should a specific version be cited?
- Must a citation have a date? If so, is a date of retrieval sufficient?
- When should a contribution or change to data require a version update? How should this be controlled or labeled?
- Must a citation to later versions of the data include version acknowledgement/information in addition to a date?
- When should a citation to a dataset contain fixity information that can be used to validate the version used?
Data citation and replicable research are more often stated in policy than followed in practice. What incentives and interventions would be most useful to align practice with policy, and encourage appropriate data citation?
- What are possible opportunities for professional societies?
- What are possible opportunities for journal editors, reviewers and publishers?
- What are possible opportunities for authors?
- What are possible opportunities for third-party services?
What research is required to understand the impact, incentives, requirements, etc. related to data citation?
Expected Workshop Report Themes
- identify consensus areas for data citation
- identify a set of challenge areas for research
- identify gaps and opportunities for infrastructure development
- How to cite curated databases and how to make them citable
- A Proposed Standard for the Scholarly Citation of Quantitative Data
- We Need Publishing Standards for Datasets and Data Tables
- ISO Draft on Language resource management
-- Persistent identification and sustainable access
- DataCite Metadata Kernel Draft
- PERSID Project Final Report on State of the Art in Identifier Systems
- Data Publication (manuscript submitted to DCC)
- Dryad's Data Citation Resource Page:
- Presentations from the IZA Workshop on Persistent Identifiers for Social Science Data
- Claddier Project Recommendations for Linking and Citations