Data Management and Sharing Plan | UAMS Winthrop P. Rockefeller Cancer Institute

Element 1: Data Type

A. Types and amount of scientific data expected to be generated in the project:

Follow-up survey will be sent to all 26,375 ARCH participants (updated contact and cancer diagnosis, tobacco), 1,836 participants will be reached for deep exposure surveys (diet, work history), and 900 will be reached for additional biospecimen [individual exposure, mediators (methylome, immune assays)]. Clinical data will be obtained from the site electronic health record at the UAMS TRI and RRN recruitment sites.

Clinical & survey data will be captured by each site into the REDCap secure electronic data capture system (EDC). Each user will be given role specific access to the EDC, with access controlled by granting users.

Cancer Registry datalinkage were approved and linked by the Arkansas Department of Health and stored in secured environment with limited access. Additional linkage with all cancer outcome will be requested.

All-Payer Claims Databaselinkage will be conducted by the Arkansas Center for Health Improvement for consented participants and stored in secured environment with limited access.

Epigenetic data will be obtained from saliva samples at baseline and follow-up visit following informed consent. Epigenetic DNA methylation arrays (EPIC) will be used for a total of 7.5 TB data.

Proteomics data will be obtained from blood samples, and will be collected using Thermo Orbitrap class mass spectrometers. The total amount of proteomics data produced will be approximately 1.0 terabyte (TB).

*For both genomic and proteomic data, raw files will be transferred to an on-site, centralized storage system for management and analysis, and will be uploaded to a relevant public repository. The total amount of proteomics data produced will be approximately 500 megabytes (MBs).

Metabolomics datawill be obtained from urine and blood samples and assayed using Thermo Orbitrap class mass spectrometers. The total amount of data produced will be approximately 1.0 terabyte (TB).

Heavy metal data will be obtained from saliva and urine samples and assayed using Inductively Coupled Plasma Mass Spectrometry (ICP-MS) coupled with UHPLC for speciation. The total amount of data produced will be approximately 0.5 terabyte (TB).

Social environmental data will be obtained from publicly available datasets including census-tract data from the EPA’s National Air Toxics Assessment and AirToxScreen for ambient air heavy metals and carcinogenic chemicals, as well as historical data to determine long-term exposure.

B. Scientific data that will be preserved and shared, and the rationale for doing so:

Demographical data: Clinical data that will be preserved and shared are demographical data and anthropometric measurements, among other data pertinent to the study.

Survey & social environmental data: Except mentioned in Section 5 below, de-identified individual and aggregate raw and recoded survey data will be shared. The de-identification process will remove direct and indirect respondent identifiers. Once data are confirmed final, respondent identifiers will be deleted.

Epigenetic, metabolomics, proteomics data described in Section A will be preserved and shared through public data repositories and journals. Project-related data that will be preserved and shared includes raw beadchip datasets, raw mass spectrometry files, and metadata files which link study groups, sample identifications, coding schema, and open-source software necessary for third party data processing to each particular raw file name. This raw data, metadata files, and associated software lists will be preserved and shared with the purpose of disseminating findings and allowing for reproducibility of results. Processed data will be shared as it becomes available during the course of study progression. See Element 4.

C. Metadata, other relevant data, and associated documentation:

In order to facilitate management, interpretation, and sharing of data, all sample lists, experimental groups, file labels, and coding schema will be included in a spreadsheet format (.csv or .xls) when uploaded to data repository and will be associated with the appropriate epigenetic and proteomic datasets. We will also upload a text file that will denote which software to be used for which raw epigenetic or proteomic dataset.

The protocol, sample informed consent, case report forms, data dictionary, and code book will be made accessible in data repositories where data are shared.

Element 2: Related Tools, Software and/or Code

For epigenetic datasets, raw .idat files can be accessed and aligned using the free, open-source packages in R with appropriate reference genome freely available through NCBI, Genbank, dbSNP, RefSeq, and other genome databases. For proteomic datasets, raw files (.raw) can be processed using freely available, open-source tools including ProteoWizard for file conversion and EncyclopeDIA for library searching against reference libraries derived from UniProt. For metabolomics datasets, raw files (.raw) can be processed using mzCloud. All software needed to reproduce analysis, including search parameters for deriving reference library, will be included as in the metadata text file, which will also associate appropriate raw datasets and sample identification lists with the necessary software for processing. Clinical and survey data will be collected in REDCap and analyzed using statistical packages in R and SAS.

Element 3: Standards

Data will be managed and shared in accordance with the FAIR data principles. Researchers will make every effort to ensure the data shared is findable, accessible, interoperable, and reusable. Shared data will be deidentified, and original data will be maintained at the investigator’s institution. Information on key data processing and quality control protocols will be made available along with the data.

Element 4: Data Preservation, Access, and Associated Timelines

A. Repository where scientific data and metadata will be archived:

For epigenetics, raw data (.idat files) and associated metadate files will be uploaded to the Gene Expression Omnibus (GEO), a public repository that archives and freely distributes comprehensive sets of microarray, next-generation sequencing, and other forms of high-throughput functional genomic datasets submitted by the scientific community. Epigenetics datasets on GEO are curated and readily searchable for no fee, and the repository provides tools to help users query and download experimental data.

For proteomics and metabolomics, raw data (.raw files) and associated metadata files will be uploaded to the PRIDE Archive, a centralized, standards compliant, public data repository for mass spectrometry proteomics data. PRIDE is a core member in the ProteomeXchange (PX) consortium, which provides a standardized way for submitting mass spectrometry-based proteomics data to public-domain repositories.

B. How scientific data will be findable and identifiable:

GEO is an NIH-supported data repository that hosts open-access array- and sequence-based datasets. GEO requires submission of raw data files, processed data files, and a metadata spreadsheet. Submissions are linked to a Principal Investigator’s My NCBI account. All submitted data is reviewed by a curator before being assigned a unique accession number. Once data is approved and an accession number is assigned to the dataset, the accession number can be cited in manuscripts to find and identify the appropriate dataset.

PRIDE is a member of ProteomeXchange (PX) consortium that provides open access to proteomics datasets. It requires raw and processed data files to be submitted. Upon submission of an initial dataset, a data curator will perform an initial assessment. If data is validated successfully, the dataset is assigned a unique identifier that can be cited in manuscripts to find and identify the appropriate dataset.

C. When and how long the scientific data will be made available:

Data generated from this project will be made available soon and no later than the time of manuscript publication or the end of the project period, whichever comes first. Data will remain readily available and easily accessible as long as it remains beneficial to the larger scientific community or the broader public.

Element 5: Access, Distribution, or Reuse Considerations

A. Factors affecting subsequent access, distribution, or reuse of scientific data

While human subject’s data will be collected during the course of this project, the study meets the criteria for Exemption 4. No identifiable data will be associated with the human subject’s samples.

B. Whether access to scientific data will be controlled:

Both GEO and PRIDE datasets can remain private until either the PI elects to share or the manuscript associated with the dataset is published, whichever comes first.

C. Protections for privacy, rights, and confidentiality of human research participants:

Informed consent documents used for the proposed study will include explicit language informing the participant or legally authorized representative that residual biological specimens, including DNA, may be stored in a biorepository for other scientific investigations. To protect research participants’ privacy and confidentiality, data submitted to the repository will not include personally identifiable information such as names or addresses. Additional protections, such as the approach for managing Health Insurance Portability and Accountability Act identifiers, will be used for de-identification or to provide a limited data set to minimize the risk of participant re-identification.

Element 6: Oversight of Data Management and Sharing

The Principal Investigators for this project, Drs. Hsu, Su, and Koss, will ensure that this Data Management and Sharing (DMS) Plan is followed. Data Management Informatics support provided by UAMS Comprehensive Informatics Resource Center (CIRC) will manage the data and serve as oversight of compliance with the accepted DMS Plan. Compliance will be evaluated annually during the award period and progress towards the plan’s DMS activities will be included in the annual Research Performance Progress Report submitted to the NCI Project Officer. At the project conclusion, the final progress report will summarize how the DMS objectives were fulfilled and provide links to the shared dataset(s).