Criteria for evaluating or creating data repositories
Background
Successful data sharing requires putting data where it can be discovered and reused. These places are data repositories; persistent, findable, searchable entities that provides infrastructure for long-term storage and access to data. A repository should provide for data publication by data holders, as well as access for using/reusing data. It should not be a place into which data is contributed without any way to retrieve it in the future. Intertidal has developed this list of repository evaluation criteria that synthesizes multiple sources (see Resources and Guidelines), and indicates “should have” criteria (no asterisk) and “nice to have” criteria (asterisk).
Individual researchers and data holders may already have preferred subject matter repositories where they regularly publish and access data (e.g. the Global Biodiversity Information Forum, aka GBIF). However, for projects that span a wide range of fields and data types (e.g. eDNA, underwater video, satellite images, socioeconomic data) it’s important to make sure all data are discoverable, accessible, well-documented, and preserved across all partners. Creating a recommended repository list can support project partners by:
- Simplifying choices for where to publish and/or search for data
- Identifying repositories that meet specific project needs, such as supporting the CARE principles for Indigenous Data Sovereignty or providing custom metadata tags that feed into a project data catalog
- Avoiding confusion between data repositories that provide long-term data access and storage, and other data tools and platforms (e.g. ERDDAP) that do not
Since repositories often have data structure requirements, it’s valuable to know where data will eventually be published before data collection starts, so processes can be set up accordingly (see DMSP guide).
This list is designed for discussion with a project team or organization, ideally one including data managers, data users, and researchers, so that each project or organization can make transparent decisions about which criteria are important to them and where best to share their data. There may not be a perfect solution for every data type, and in choosing a repository list the team will have to decide what options are “good enough.” Documenting the repository selection process and guiding partners to long-term storage and access solutions contributes to ocean data stewardship and making more ocean data FAIR, useful, and (re)used.
Technical Criteria
- Unique Persistent Identifiers (PIDs)
- Assigns PIDs to datasets
- Each PID points to a persistent landing page
- Security and Integrity
- Have documented criteria for preventing unauthorized access, modification, or release of data
- Have security levels appropriate to the sensitivity of data
- Implement data anonymization as appropriate
- Confidentiality as necessary
- Ensure administrative, technical, and physical safeguards
Data Criteria
- Common formats
- Enable easier dataset and metadata to be download, access, or export
- Support widely used and non-proprietary formats for data and metadata
- Data types
- Clear communication of the types of data that can be ingested or stored
- Data Connections
- Clearn communication of the API standards or other requirements for connected data sources
Management / Curation Criteria
- Long-term Sustainability
- Provide long-term management of data
- Maintain availability of datasets
- Have stable technical infrastructure
- Have stable funding
- Have contingency plan for natural disasters and other events
- Curation and Quality Assurance
- Ensures datasets have metadata
- Standardized quality of record or dataset title
- Informative, short, descriptive abstracts or summaries about a dataset or record
- *Provide or enable others to provide expert curation
- Retention Policy
- Have a policy for data retention
Documentation Criteria
- Metadata Support
- Use schemas appropriate to the community
- Include search keywords to make datasets discoverable
- *Include Local Contexts or other Indigenous Knowledge labels as appropriate
- Provenance Support
- Record the origins, chain of custody, and modifications of the data
- *Link datasets to provide context and aid in discoverability
Data Use / Reuse / Sharing Criteria
- Discoverability
- Use PIDs
- *Repository is federated with other repositories
- *Repository is listed in a federated catalog
- Free and Easy Access
- Provide no cost to access datasets and metadata
- Support broad, equitable, open access
- Provide timely access after submission
- Broad and Measured Reuse
- Have broadest possible terms of reuse
- Measure attribution, citation, and reuse
- Clear Use Guidance
- Clearly documented terms for access and reuse
Management & Governance Criteria
- Supporting restricted access
- Have a roles-based access management process
- Describe parameters that can be defined for each role
- Keep access logs
- Maintain privacy, confidentiality, tribal sovereignty, and protection of sensitive data
Pricing Criteria
- Account Pricing Information
- Storage size / data amount
- Number of users
- Specific features or services (e.g. visualization)
- Availability of discounts
- *Volume discounts
- *Nonprofit organization discounts
Certification Criteria
- *Repository is certified by CoreTrustSeal
Resources and Guidelines
FAIR Criteria
CARE Criteria
White House Guidelines (aka NSTC guidelines)
NIH guidelines
Center for Open Science re: guidelines
Belmont Forum
OpenAIRE PLOS
Taylor and Francis
Swiss NSF
FAIRsharing and DataCite
COAR
O’Brien et al. 2024 Earth Science Data Repositories: Implementing the CARE Principles. Data Science Journal