Methods
The is-pero web server is an advanced, automated platform for the prediction and classification of peroxisomal proteins, designed to streamline large-scale proteomics analyses. By integrating cutting-edge machine learning models with sequence analysis techniques, is-pero delivers high-accuracy predictions for protein localization.
Developed through a collaboration between the University of Bologna, Wageningen University, and Ruhr University Bochum, is-pero simplifies peroxisomal protein identification using a structured, multi-step workflow:
Key Workflow Steps
Step 1 (is-pero): A binary classifier powered by XGBoost determines whether a protein is peroxisomal or non-peroxisomal. This model uses ESM-2 embeddings to convert protein sequences from FASTA files into rich numerical representations, ensuring highly accurate predictions.
Step 2 (in-pero): For proteins identified as peroxisomal, a multi-layer perceptron (MLP) further classifies them as either matrix or membrane-associated. The output includes a probability score (In-Pero Proba) to indicate the likelihood of sub-localization.
Step 3 (PTS Signal Analysis): This step detects PTS1, PTS2, and MPTS signals through regular expression matching, hydrophobicity scoring, and sequence motif analysis, enabling precise identification of targeting signals.
Step 4 (is-MPTS): A logistic regression model analyzes MPTS signals in membrane-associated proteins to predict PEX19-binding sites.
Performance and Accuracy
The is-pero pipeline achieves good performance in all three classification tasks. In the first step proteins are classified as peroxisomal and not peroxisomal. This protein large language model-basedlassifier classifier was trained on a set of 1,339 peroxisomal and 12,051 non peroxisomal proteins and tested on 335 peroxisomal and 3,320 non peroxisomal proteins. The performance of the first classification task on the testing set achives an overall accuracy of 98% and an AUC 0.99. In the second classification task discriminating between proteins located in the membrane or the matrix of the peroxisome, our method, traied on a set of 132 membrane and 28 matrix proteins and tested on a set of 85 membrane and 59 matrix proteins, achives in testing an overall accuracy of 89% and an AUC of 0.98. In the fourth step the method validates the presence of a peroxisomal membrane targeting signal (MPTS) detected by regular expressions in the third step. The classifiers tested in cross-validation on a set of 56 MPTS and 109 and non MPTS not containing MPTS achieved an overall accuracy of 81% and an AUC of 0.78.
Outputs and Applications
Outputs are provided in user-friendly TSV files, detailing sequence IDs, localization probabilities, and signal detection metrics. In details the TSV output includes the following data:
Sequence_ID: sequence identifier provided in the fasta file.
Peroxisomal: yes or no
Probability_Peroxisomal: Probability of the sequence to be peroxisomal
Matrix: the subcellular localization of the peroxisomal protein is in matrix or membrane.
Probability_Matrix: Probability of the sequence to be in the matrix of the peroxisome.
Peptide: membrane peroxisomal targeting signal (MPTS) peptide detected by regular expression.
Start: position of starting point in the MPTS peptide.
Score: MPTS pattern matching score.
Signal: membrane or other PTSs.
Probability_MPTS: probability pf the peptiade to be MPTS.
Tailored for high-throughput proteomics, is-pero offers a scalable and efficient solution for researchers focused on peroxisomal protein analysis and broader organelle-specific studies.
References
Please cite the publications below:
Anteghini, M., Martins dos Santos, V.A.P., Saccenti, E. (2021).
In-Pero: Exploiting deep learning embeddings of protein sequences to predict the localisation of peroxisomal proteins
International Journal of Molecular Sciences. 22(12), 6409.
Anteghini, M., Haja, A., Dos Santos, V. A. P. M., Schomaker, L., Saccenti, E. (2023).
OrganelX web server for sub-peroxisomal and sub-mitochondrial protein localization and peroxisomal
target signal detection. Computational and Structural Biotechnology Journal 21, 128–133.