CURATE(D): Checklist for Data Curation

Suggested Citation

Suggested Citation: Data Curation Network (2022). "CURATE(D) Steps and Checklist for Data Curation, version 2" http://z.umn.edu/curate.

About

The following information pertains to the curation of this dataset, meaning it references and describes internal processes and procedures to prepare the data and documentation for importing into a public repository. This document is meant only for those who want to understand the curation process and is not particularly useful for those simply wishing to analyze the data within this repository.

ToDo items

CHECK Step

UNDERSTAND Step

Essential Tasks

Examine files, organization, and documentation more thoroughly.

This was an iterative process that cycled between writing code to generate dataset documentation and the files and metadata needed to create it. The main driver was the metadata JSON file which contains the data to initialize a dataset, determines the files to be imported into the repository, defines the data-table variables and categorical values, and other dataset descriptive information. As scripts were modified to parse the metadata and generate new documentation files, the generated files were imported into the repository and documentation website to ensure optimal accessibility and organization. Discussed empty data table cells with the team and a value of -99, instead of NULL, was decided since that would keep numeric columns treated as numeric and not string (if NULL were used)
- Are there changes that could enhance the dataset? NA
- Are there missing data?
  No. This dataset is a subset of the individual data collection instruments and image archives it draws from, which may, by themselves, contain more records that are not specific to this subset.
  - In this task we asked our own sets of questions to check for missing data.
  - Do the participants who have US images appear in the final data tables? Answer: Yes.
  - Do the number of participants with KL grades (df_PaKneeKlGrades) match up with the participants with images? Answer: Yes. Scripts compared the subjects with images and KL grades.
Are any scripts truncating the data?

Not exactly. One of the records was flagged as 'ignore' in Qualtrics pain, aching and stiffness survey that needed to be manually corrected. Also the dftemp2 dataFrame is missing some participants. This is likely due to mismatches between df_qPrivateSubjectEligibility response IDs and the subject id reference data (which I corrected using the combineDemographicData method). There was one participant who had US images but did not participate in any of the online surveys and only partial clinic, and another participant who had everything else except for US images (so our total should be 881 - see SubjectExclusions.xlsx).
- Could a user with similar qualifications to the author's understand and reuse these data and reproduce the results?
NA (no analysis data provided)
- Are the data, documentation and/or metadata presented in a way that aids in interpretation? (e.g., README Example)
  
  Providing both machine readable data and human readable documentation
Record all questions and concerns in Curation Log.

Was concerned about data encoding checks but using VSCode with UTF-8 default encoding and Jupyter notebook processing via Python ensured the encoding of UTF-8 for text files Researchers were concerned about dataset citation versions with the Dataverse, but this is out of scope for us to handle since the Dataverse controls the versioning

Tasks vary based on file formats and subject domain. Sample tasks based on format:

Tabular Data (e.g, Microsoft Excel) Questions:

Check the organization of the data–is it well-structured?
Are headers/codes clearly defined?

The metadata file fully describes the data and variables
Is quality control clearly defined?
Is methodology clear and sufficient?

Database(s) Questions:

Is there documentation on tables, relationships, queries, etc?

NA
Can the data be exported (to CSV(s), TXT or other) easily?

Yes via API or web interface
Which tables or queries are the relevant ones used in a publication?

NA

Code Questions:

Does the provided code execute without errors?

The code was tested for errors and no errors were thrown at the time of publication (code can become deprecated over time and could throw errors in the future but the code is only important to the generation of the documentation)
Is the code commented, i.e., did the author provide descriptive information on sections of code?

Yes
Is data for input missing? Are environmental conditions and parameters noted? Is it clear which language(s) and version(s) are used?

NA
Does the code use absolute paths or relative paths? If absolute paths, is this documented in the README?

Relative
Are packages or additional libraries used? Is so, is this noted with clear use instructions?

Code only provided to better describe the data, but contains clear instructions
Are any data organized consistently for access by the code?

NA
Is there an indication of whether the depositor intends users to be able to run the code and reproduce results, or just see the process used?

Code only provided for advanced users wishing to use in their own projects and is not essential to the use of the

REQUEST Step

Essential Tasks

Ask about additional data contributors, beyond publication authors. Consider using the Contributor Roles Taxonomy to communicate this: https://casrai.org/credit/
Summarize conversations / outreach efforts in Curator Log

Reaching out to PIs to review the dataset before final publication Received PI and researcher suggestions for metadata updates, so updating the metadata file, removing the dataset draft from the repository and uploading the revisions to the repository. Adding scripts to include the variable labels in the statistics for the README. Once the new dataset draft is uploaded I regenerate the README file based on the latest metadata pushed to the repository.

AUGMENT Step

Essential Tasks

TRANSFORM Step

Essential Tasks

Check whether preferred file formats are in use
- Yes, using FAIR file formats.
- Retain original formats
  - Preserving original data this dataset is derived from as part of the project data curation
Check whether software needed is readily available
- Dataset files may be analyzed using open Python or other popular tools
- Suggest open source options, if applicable and appropriate
- Jupyter notebooks and Python are useful for analyzing the data as there is a strong user community for Python and Jupyter provides you with an environment to both run code and document your steps
- Ensure software and software version is documented
  
  NA
Convert any data visualization(s) that are not accessible (e.g., R visualizations, which need to be converted for screen reader use, or visualizations that do not meet color contrast guidelines) Reorganize files as appropriate
Standardize file names
Record any transformations in Curator Log

EVALUATE Step

Essential Tasks

Test that files successfully download
Check that any transformations didn't introduce problems
Review final state of data and record with researcher before publication
Add any final changes to Curator Log

FAIR evaluation

Findable:

Metadata exceeds researcher/ title/ date.
There is a unique Persistent ID (DOI, Handle, PURL, etc.).
Data/record is discoverable via web search engines.

Accessible:

Data/ record are retrievable via a standard protocol (e.g., HTTP).
Data/ record are free, open (e.g., via a download link).

Interoperable:

Metadata is formatted in a standard schema (e.g., Dublin Core). Metadata is provided in machine-readable format (OAI feed).

Metadata requested from the Dataverse repository is available in a variety of formats. The main dataset metadata file also includes an embedded schema that describes the JSON metadata.

Reusable:

Data include sufficient metadata and supporting documentation about the data characteristics for reuse.

Embedded JSON schemas into the JSON metadata files
A way to contact the researcher directly for further questions is provided

Contact form provided in the Dataverse dataset as well as the User Guide
There are clear indicators of who created, owns, and stewards the data.
Data are released with clear data usage terms (e.g., a CC License).

DOCUMENT Step

Essential Tasks

Ensure the following information is captured in the Curator Log:
- Activities taken during the CURATE process
- Accessioning & deposit records (Names, dates, contact information, submission agreements, etc.)
  - Initial deposit to draft state is to test the import and check expected files are imported
  - The Jupyter notebook is then tested and adjusted to work with the data in the draft state
- Repository collection metadata
- Provenance logs (changes by curators in the Transform step)
- Service workflow
- Correspondences and other interactions
- Preservation packaging
- Any additional requirements at your institution

About​

ToDo items​

CHECK Step​

UNDERSTAND Step​

Essential Tasks​

REQUEST Step​

Essential Tasks​

AUGMENT Step​

Essential Tasks​

TRANSFORM Step​

Essential Tasks​

EVALUATE Step​

Essential Tasks​

FAIR evaluation​

DOCUMENT Step​

Essential Tasks​

About

ToDo items

CHECK Step

UNDERSTAND Step

Essential Tasks

REQUEST Step

Essential Tasks

AUGMENT Step

Essential Tasks

TRANSFORM Step

Essential Tasks

EVALUATE Step

Essential Tasks

FAIR evaluation

DOCUMENT Step

Essential Tasks