Proteogenomic peptide discovery in large search spaces
Doctoral thesis
Date of Examination:2024-02-23
Date of issue:2025-01-30
Advisor:Dr. Juliane Liepe
Referee:Prof. Dr. Henning Urlaub
Referee:Dr. Johannes Soeding
Referee:Prof. Dr. Anne-Christin Hauschild
Referee:Prof. Dr. Jochen Rink
Referee:Dr. Alex Faesen
Files in this item
Name:Dissertation_YH_2025.pdf
Size:5.73Mb
Format:PDF
Description:Dissertation
Abstract
English
Novel peptide and protein discovery is of therapeutical relevance and ignited the development of methods for their identification, commonly done via mass spectrometry. The latter relies on a well-characterised proteogenomic search space, whose size is not known, especially when considering noncanonical peptides that are derived from alternative transcription and translation events. The sequence content of proteogenomic search spaces acts as an informed prior about sample composition, but if the prior assumptions are incorrect, peptide and protein identification will be compromised. We developed an automated workflow consisting of Sequoia for the creation of RNA sequencing informed and exhaustive sequence search spaces for various noncanonical peptide strata, and SPIsnake for pre-filtering and exploration of sequence search space prior to mass spectrometry searches. We applied our automated workflow to characterise the exact sizes of tryptic and nonspecific peptide sequence search spaces in a variety of definitions, their reduction when using RNA expression, their inflation by chemical post-translational modifications, and the frequency of peptide sequence multimapping to different noncanonical origins. Furthermore, we explored the application of Sequoia and SPIsnake on HLA-I immunopeptidome sequence identification, allowing us to rescue sensitivity in peptide identification when confronted with inflated search spaces. Taken together, Sequoia and SPIsnake pave the way for an educated development of methods addressing large-scale exhaustive proteogenomic discovery by exposing the consequences of database size inflation and ambiguity of peptide and protein sequence identification.
Keywords: Mass-spectrometry, Proteomics, Proteogenomics, MHC-I, Immunopeptidome