Data curation template

Suggested Citation

Suggested Citation: Data Curation Network (2022). "CURATE(D) Steps and Checklist for Data Curation, version 2" http://z.umn.edu/curate.

CHECK Step

Check data files/code and read documentation

In this step we secure the dataset by inventorying and reviewing the contents, applying local appraisal and selection criteria. Common CHECK steps include:

Review to ensure data is in scope for the repository
Inventory the contents of the data files (e.g., open and sample the files or code)
Verify all metadata provided by the researcher; check available documentation

Key Ethical Considerations

Review participant agreement and data use agreements; examine potential impacts of sharing this data. Consider:
- Individuals and communities represented
- Representativeness of diverse human populations Protection or endangerment status of species
- Geographic locations (e.g., contested boundaries, historical and current political situations)
- Animal research ethics and approval
Is it possible that the dataset may impact a specific group?
Does this dataset follow compliance & institutional policy?

Essential Tasks

UNDERSTAND Step

Understand the data (or try to)

In this step, examine the dataset closely to understand what it is, how the files interrelate, and what information is needed to reuse. Common UNDERSTAND steps include:

Check for quality assurance and usability issues such as missing data, ambiguous headings, code execution failures, and data presentation concerns
Try to detect and extract any "hidden documentation" inherent to the data files that may facilitate reuse or expose unintended information
Determine if the documentation of the data is sufficient for a user with similar qualifications to the researchers to understand and reuse the data. If not, recommend or create additional documentation (e.g., a README.txt template)

Key Ethical Considerations

If working with human data, is this research done with and not on communities and populations involved? (You may wish to review data sources, researchers, and their connections to the communities and subjects they are serving to facilitate further conversation with researcher(s).)
- Are there authoritative group representatives who should be contacted in the next (request) step?
Are there labels or other descriptive indicators that could be applied to better represent or protect an identified group of people impacted by this dataset? (Example: TK labels)

Essential Tasks

Examine files, organization, and documentation more thoroughly.
- Are there changes that could enhance the dataset?
- Are there missing data?
- Could a user with similar qualifications to the author's understand and reuse these data and reproduce the results?
- Are the data, documentation and/or metadata presented in a way that aids in interpretation? (e.g., README Example)
Record all questions and concerns in Curation Log.

Tasks vary based on file formats and subject domain. Sample tasks based on format:

Tabular Data (e.g. Microsoft Excel) Questions:

Check the organization of the data–is it well-structured?
Are headers/codes clearly defined?
Is quality control clearly defined?
Is methodology clear and sufficient?

Database(s) Questions:

Is there documentation on tables, relationships, queries, etc?
Can the data be exported (to CSV(s), TXT or other) easily?
Which tables or queries are the relevant ones used in a publication?

Code Questions:

Does the provided code execute without errors?
Is the code commented, i.e., did the author provide descriptive information on sections of code?
Is data for input missing? Are environmental conditions and parameters noted? Is it clear which language(s) and version(s) are used?
Does the code use absolute paths or relative paths? If absolute paths, is this documented in the README?
Are packages or additional libraries used? If so, is this noted with clear use instructions?
Are any data organized consistently for access by the code?
Is there an indication of whether the depositor intends users to be able to run the code and reproduce results, or just see the process used?

To view additional UNDERSTAND steps based on format, view the following primers:

Acrobat PDF Primer
ATLAS.ti Primer
Confocal Microscopy Image Primer
Geodatabase Primer
GeoJSON Primer
Jupyter Notebook Primer
Microsoft Access Primer
Microsoft Excel Primer
netCDF Primer and
Tutorial using NCAR dataset
SPSS Primer
STL Primer
R Primer
Tableau Primer (Twitter primer?)
Wordpress.com Primer

REQUEST Step

Request missing information or changes

In this step, generate a list of questions to help the researcher fix any errors or issues and enrich the usability of the data. Common REQUEST steps include:

Triage and prioritize issues.
Identify and highlight those with the highest data reuse implications
Convey a sense of urgency, as there it becomes more difficult to get responses from researchers as time passes.
Collaborate with the researcher(s) to make necessary changes
Communicate any changes you, the curator, will make on their behalf
Pause and consider how best to frame and communicate requests. This should be the start of a conversation.

Key Ethical Considerations

Consider asking researchers if their participants will be notified that their data (in addition to published results) are being shared.
If you feel uncomfortable about sharing the data in its current state and/or it does not meet your institution's requirements, reserve the right not to publish.
Consider asking researcher(s) if there are limitations to how data could/should be used to include in documentation. (Based on, e.g., representativeness of sample).

Essential Tasks

Ask about additional data contributors, beyond publication authors. Consider using the Contributor Roles Taxonomy to communicate this: https://casrai.org/credit/
Summarize conversations / outreach efforts in Curator Log

Sample email to researcher:

Dear [*name of the data set author or contact*],  

Thank you for depositing your data set, [*title of the data set*], to [*name of repository*].  

After we receive a data set, we review it to ensure that the data we host is as complete and understandable as possible. We have reviewed your data set and have the following recommendations for you:  

- Recommendation #1 
- Recommendation #2 
- Recommendation #3 
- Recommendation #4   

We look forward to hearing your response. Please let us know if you have any questions about our recommendations. We would be happy to talk with you or meet in person to discuss our review of your data, if you would like 

Sincerely,

[*Name of Curator*]

AUGMENT Step

Augment the dataset

In this step we ensure metadata conforms to repository and/or appropriate discipline standards; adjust metadata to improve findability and accessibility; and improve documentation to make data more understandable, interoperable and reusable. Common AUGMENT steps include:

Enhance metadata to best facilitate discoverability, such as by ensuring datasets have a persistent identifier.
Create and apply metadata for the data record, including descriptive keywords
When appropriate, structure and present metadata in domain-specific schemas to facilitate interoperability with other systems
Implement any other agreed-on enhancements to metadata or documentation following discussion with researcher

Key Ethical Considerations

Make sure bibliographic information reflects the correct author attribution.
Ensure any augmentation by the depositor to resolve ethical questions from previous steps is completed.

Essential Tasks

Transform file formats

In this step, consider the file formats in the dataset to make them more interoperable, reusable, preservation friendly, and non-proprietary when possible.¹ Common TRANSFORM steps include:

Identify specialized file formats and their restrictions (e.g., Is the software freely available? If so, link to it or archive it alongside the data)
Propose open-source or more reusable formats when appropriate
Retain original file formats

Footnote 1

¹See Cornell's list of preservation format recommendations: http://guides.library.cornell.edu/ecommons/formats

Key Ethical Considerations

Consider how best to navigate researcher bandwidth limitations and ownership of data with repository commitments to reducing barriers to reuse.
Decide how to balance the potential benefits of transformation with the risks of mistakes and loss of content/context, especially if the curator or repository will be performing transformation. Document the decision.

Essential Tasks

Check whether preferred file formats are in use
- If not, recommend conversion
- Retain original formats
Check whether software needed is readily available
- Suggest open-source options, if applicable and appropriate
- Ensure software and software version is documented
Convert any data visualization(s) that are not accessible (e.g., R visualizations, which need to be converted for screen reader use, or visualizations that do not meet color contrast guidelines) Reorganize files as appropriate
Standardize file names
Record any transformations in Curator Log

EVALUATE Step

Evaluate and rate the dataset

In this step, review the dataset and companion data record against international standards, including FAIR,² CARE,³ and FATE.⁴ Common EVALUATE steps:

Score the dataset and recommend ways to increase the FAIRness of the data
Review data for ethical concerns in line with CARE and FATE

Footnote 2

² Rubric evaluating the FAIR principles are based on the scoring matrix by Dunning, de Smaele, & Böhmer (2017).
³ CARE principles: https://www.gida-global.org/care ⁴ FATE in AI: https://www.microsoft.com/en-us/research/theme/fate/

Key Ethical Considerations

Final review--remember it is not too late to surface any ethical concerns.
Verify you are not using racist/harmful words/language.
Remind the submitter of their responsibility if they choose to ignore requests for de-identification or similar concerns.

Essential Tasks

Test that files successfully download
Check that any transformations didn't introduce problems
Review final state of data and record with researcher before publication
Add any final changes to Curator Log

This is a sample checklist for evaluating datasets against a set of principles.

FAIR evaluation

Findable:

Metadata exceeds researcher/ title/ date.
There is a unique Persistent ID (DOI, Handle, PURL, etc.).
Data/record is discoverable via web search engines.

Accessible:

Data/ record are retrievable via a standard protocol (e.g., HTTP).
Data/ record are free, open (e.g., via a download link).

Interoperable:

Metadata is formatted in a standard schema (e.g., Dublin Core). Metadata is provided in machine-readable format (OAI feed).

Reusable:

Data include sufficient metadata and supporting documentation about the data characteristics for reuse.
A way to contact the researcher directly for further questions is provided
There are clear indicators of who created, owns, and stewards the data.
Data are released with clear data usage terms (e.g., a CC License).

DOCUMENT Step

Document curation activities throughout

In the Curator Log mentioned throughout this guide, record the significant treatments or actions applied to the dataset. This is for your archival record keeping (distinct from documentation the researcher(s) created to accompany their own datasets). DOCUMENT requires:

Recording all information relevant to the tracking and administration of the deposit, about who did what to the dataset and when
Tracking communication with the researcher(s)

Key Ethical Considerations

Document that disclosure risk review has taken place. State if changes from original data have been made, but do not give enough detail on changes to reverse-engineer any anonymization.
Include consent (or waiver) and/or IRB approval of sharing with administrative documentation.
Consider collecting contributor demographics.

Essential Tasks

Ensure the following information is captured in the Curator Log:
- Activities taken during the CURATE process
- Accessioning & deposit records (Names, dates, contact information, submission agreements, etc.)
- Repository collection metadata
- Provenance logs (changes by curators in the Transform step)
- Service workflow
- Correspondences and other interactions
- Preservation packaging
- Any additional requirements at your institution

CHECK Step​

Key Ethical Considerations​

Essential Tasks​

UNDERSTAND Step​

Key Ethical Considerations​

Essential Tasks​

REQUEST Step​

Key Ethical Considerations​

Essential Tasks​

AUGMENT Step​

Key Ethical Considerations​

Essential Tasks​

TRANSFORM Step​

Footnote 1​

Key Ethical Considerations​

Essential Tasks​

EVALUATE Step​

Footnote 2​

Key Ethical Considerations​

Essential Tasks​

FAIR evaluation​

DOCUMENT Step​

Key Ethical Considerations​

Essential Tasks​

CHECK Step

Key Ethical Considerations

Essential Tasks

UNDERSTAND Step

Key Ethical Considerations

Essential Tasks

REQUEST Step

Key Ethical Considerations

Essential Tasks

AUGMENT Step

Key Ethical Considerations

Essential Tasks

TRANSFORM Step

Footnote 1

Key Ethical Considerations

Essential Tasks

EVALUATE Step

Footnote 2

Key Ethical Considerations

Essential Tasks

FAIR evaluation

DOCUMENT Step

Key Ethical Considerations

Essential Tasks