Title of VM Grant: Omics Data Gathering
Start and end date: 30/09/2024 to 21/10/2024
Candidate: Marco Barreca, University of Milano
Description of the work carried out during the VM
The Virtual Mobility project focused on creating a Systematic Search Strategy to collect publicly available omics datasets from immuno-oncology experiments. The main goal was to optimize the dataset search and collection process providing easy-to-use guidelines. All the produced material is available at the link on the last page.
The work was organized into different phases, focusing on the establishment of a taskforce and the technical development of the search strategy.
1. Data Gathering Taskforce involvement and first meeting: At the beginning of the VM period, the interested researchers were involved and a Data Gathering Taskforce was established. During the first meeting the taskforce brainstormed on how to proceed in this pilot phase, and the tasks were assigned. All the documents and useful files were organized in a Google Drive shared folder.
2. Pilot Query generation and Data Collection: The query generation process was primarily centralized around the project lead. Two pilot search queries were developed to find transcriptomic datasets related to immunotherapy in breast cancer and adenocarcinoma from the Gene Expression Omnibus (GEO) database. The query results contained a summary of the datasets retrieved such as dataset ID, platform type, and number of samples. They were downloaded in txt files.
3. Conversion to Excel Format: To enhance the usability of the query results, a custom script was developed to convert the txt files into Excel spreadsheets. In these files, each row corresponded to a dataset, and each column represented its features. Additional columns were added for manual curation, such as whether the dataset was derived from bulk RNA-seq or single-cell RNA-seq, and whether the study was clinical or preclinical.
4. Pilot query results curation: The taskforce played a crucial role in the pilot manual curation of the query results. Some taskforce members downloaded a copy of the Excel file and independently reviewed the datasets found. Based on researcher’s expertise, datasets were selected or excluded using a checkbox.
5. Concordance and Similarity Analysis: After the query results curation process was completed, an analysis was carried out to evaluate the concordance between the selected datasets to define the selection criteria. Furthermore, the similarity of the dataset annotation collected by the researchers was evaluated. The goal was to assess how consistently datasets were evaluated and to identify any discrepancies or challenges in the manual curation process. This analysis was critical in highlighting potential issues with manual data selection and will inform on the development of guidelines to enhance curation accuracy.
6. Keywords and research boundaries: The taskforce members compiled a further Excel file which defines the research boundaries, underlining new fields and terms of interest and helping in the definition of selection criteria.
7. Second meeting and report writing: During the second meeting the taskforce brainstormed on the query results curation experience, discussing on the reported concordance and similarity and research boundaries. The feedback was collected in a report that will be key to establish future guidelines on the query results curation process.
Description of the VM main achievements and planned follow-up activities
The Virtual Mobility Grant led to several important achievements:
1. Systematic search strategy development: a pilot version of the systematic search strategy was developed, focusing on the optimization of search queries in GEO for RNA-sequencing datasets. The custom script to convert txt files into Excel files significantly streamlined the data processing and annotation workflow.
2. Query results curation principles: the taskforce played a central role in manually curating the query results and deciding on dataset inclusion or exclusion based on researcher’s expertise. The manual curation process proved essential in highlighting challenges and inconsistencies that can occur in dataset selection, providing a foundation for improving future selection criteria and guidelines.
3. Documentation and future refinements: comprehensive documentation of the query structure, dataset curation and analysis results were produced. This documentation will be useful to define guideline for future expansion of the project to include other data types.
Contribution to Action objective and deliverables:
1. Knowledge sharing channels: the establishment of the Data Gathering Taskforce is fostering the collaboration between researchers and facilitate the transferability of knowledge between basic, translational and clinical investigators. In particular, it will generate a data resource available for everyone in the Action and for the broader cancer community upon publication.
2. Dataset collection, processing and biobanking protocols: the documentation produced will be fundamental for the establishment of dataset collection and curation guidelines. It will lay the basis of robust protocols that can be followed to collect molecular data from clinical and preclinical samples that represent an important resource to accelerate the study of immunotherapy efficacy and toxicity and validate biomarkers to monitor these effects.
Several activities are planned to further develop and expand the project:
1. Expansion to other tumour types and treatments: the next phase will involve expanding the search strategy to include additional tumour types and other treatments by the generation of new queries.
2. Refinement of selection criteria: the insights gained from the concordance analysis of the manual curation process will inform the development of refined selection criteria for datasets and guidelines for the curation process. These criteria will help to reduce inconsistencies.
3. Ongoing taskforce collaboration: the data gathering taskforce will continue to provide feedback on the project’s progress and collaborate on refining the search strategy and curation methods. Regular online meetings will be maintained to ensure continuous knowledge sharing and collaboration.