Technological Infrastructures
All news from Area Science Park
Categories
Categories
- Tutte
- From our campuses
- Innovation services
- Institutional
- Opportunity
- Press releases
- Research infrastructures
- Technological Infrastructures
Period
Period
- All dates
- Last 12 months
- Last 6 months
- Last 3 months
DPCfam-UHGP50: a dataset for research on the gastrointestinal proteome
The Data Engineering Laboratory (LADE) at Area Science Park has recently published an article in Nature – Scientific Data on protein sequence annotation.
Thanks to technological advances in genomic sequencing, the number of known protein sequences has grown exponentially. Many of these sequences come from metagenomic projects that analyze environmental and clinical samples. Among the most relevant datasets in this field stands the Unified Human Gastrointestinal Proteome (UHGP) catalog, with a variety of applications in medicine and biology. However, the limited annotation of these sequences reduces their effectiveness.
To address this issue, the DPCfam-UHGP dataset was developed, classifying UHGP sequences into protein families that typically group proteins sharing the same biological function. The dataset contains 10,778 families, generated through DPCfam clustering, an unsupervised method that organizes sequences into single- or multi-domain architectures.
This project, part of Federico Barone‘s doctoral research supervised by Alessio Ansuini and Alberto Cazzaniga, exemplifies the fruitful interaction between data management and data science. In this context, the construction of a curated database of gastrointestinal proteins enabled more refined cataloging through advanced machine learning algorithms, allowing continuous database updates in fruitful feedback loop aimed at promoting new discoveries.
The DPCfam-UHGP50 dataset, accessible through a web server, was developed following the best FAIR (Findable, Accessible, Interoperable, Reusable) practices, with the aim of fostering new discoveries in the field of human gastrointestinal tract metagenomics.
Previously, LADE had already produced the DPCfam-UR50 database, accompanied by a publication in PLOS – Computational Biology.
Technological Infrastructures
New Frontiers of Artificial Intelligence in Protein Research
The Data Engineering Laboratory (LADE) at Area Science Park has recently published an innovative study into Bioinformatics, opening up new perspectives in the study of proteins, the fundamental building blocks of life. In fact, Francesca Cuturello, Marco Celoria, Alessio Ansuini and Alberto Cazzaniga, the authors of the study, have demonstrated how artificial intelligence can predict the impact of genetic mutations on protein stability, helping to get a better understanding of the mechanisms underlying many diseases and potentially developing new treatments. The genome of living beings is constantly mutating due to external agents or random events and this leads us to observe changes in the sequences of the proteins they synthesise.
Conducted as part of the Pathogen Readiness Platform for CERIC-ERIC (PRP@CERIC) project, the study uses AI models similar to GPT, applied to proteomics. These models are based on the analogy between a protein sequence and a sentence, with amino acids acting as “words”, allowing algorithms trained on hundreds of millions of protein sequences to be applied. Using this technique, the LADE researchers were able to predict how small variations in the amino acid sequence, such as those induced by mutations, can affect protein stability.
A particularly innovative aspect is the use of the MSA Transformer model, which utilises information on the ancestral relationships between protein sequences to enhance the accuracy of predictions. The algorithm developed by LADE offers cutting-edge performance and will be made available to the scientific community to encourage further advancements in this field.
“Predicting the effect of protein mutations through artificial intelligence allows us to explore, with great precision, complex biological phenomena that, until recently, were difficult to observe directly”, explains Francesca Cuturello, the study’s lead author. “This technology is a step forward towards innovative therapeutic solutions for a wide range of diseases.”
The team’s work has already received widespread recognition, including Francesca Cuturello’s invitation to the prestigious Research Retreat “Physics of Biological Data Analysis” at the Aspen Center for Physics and it will be presented at other international research centres, such as the ICTP and the Leibniz Center for Informatics.
For more information about LADE’s activities, click here.
Press releases
Technological Infrastructures
Francesco Ortu receives the Artificial Intelligence Prize from the University of Trieste
Francesco Ortu was awarded the Artificial Intelligence Prize from the University of Trieste for his thesis “Interpreting How Large Language Models Handle Facts and Counterfactuals through Mechanistic Interpretability” as part of the Master’s program in “Data Science and Scientific Computing”. This work was developed at the Institute for Research and Technological Innovation (RIT) of Area Science Park. The study focuses on how generative language models, like those behind ChatGPT, react when presented with text containing false information.
The work was published in the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics and presented last August in Bangkok at one of the most important conferences on Computational Linguistics and Artificial Intelligence for Natural Language.
“Research on interpretability,” explains Francesco Ortu, “aims to bridge the gap between empirical approaches and our scientific understanding of the inner workings of generative language models (LLMs). So far, most existing research in this area has focused on how models copy or recall factual knowledge. In our study, we analyzed how information propagates within the neural network, identifying the ‘neurons’ that choose whether to promote or suppress false information proposed by the user.”
Congratulations to Francesco, with best wishes for pursuing exciting discoveries during his PhD, which will soon begin at the Laboratory of Data Engineering in Area Science Park.
Technological Infrastructures
Data Science in Fundamental Physics and Its Bridge to Industry & Society
Thanks to Data Science, today we can analyse and manipulate an enormous amount of data, derived from the most disparate platforms: From social networks to medical records, from geolocators to streaming services. All with the aim of extracting new knowledge and value.
The data scientist uses advanced mathematical and physical techniques to find correlations, causal relationships, and interactions among data, developing hypotheses to test, and gradually improving analysis algorithms. Many of the techniques used are inspired by the results of fundamental physics, ranging from the physics of complex systems to high-energy physics.
These fruitful correlations and their repercussions on society, the economy, the world of work, and industry will be explored during the international conference “Data Science in Fundamental Physics and its bridge to industry & society”, which will be held in Santiago De Compostela (Spain) from 3 to 7 June. It is organised by the Galician Institute of High Energy Physics (Instituto Galego de Fisica de Altas Enerxias, IGFAE).
Matteo Biagetti, a research physicist at LADE – the Data Engineering Laboratory active in Artificial Intelligence and Data Management, will represent Area Science Park and his research activities in the field.
The conference will highlight career opportunities within the field of fundamental physics and its synergies with the job market. It will also include a session where companies can present their needs related to Data Science. By bringing together both aspects, this occasion aims at creating a framework for the mutual exchange of knowledge and will allow the development of practical synergies from Data Science to fundamental physics and from Data Science to industry.
More information on the event: Data Science in Fundamental Physics and the Bridge to Industry (usc.es)
Find out about the LADE and its team of researchers
From our campuses
Technological Infrastructures