Data engineering

All scientific publications in Area Science Park

05/11/2025

A comprehensive framework for solution space exploration in community detection

Abstract: Community detection algorithms are essential tools for understanding complex networks, yet their results often vary between runs and are affected by node input order and the presence of outliers, undermining reproducibility and interpretation. This paper addresses these issues by introducing a framework for systematic exploration of the solution space, obtained through repeated runs of a given algorithm with permuted node orders. A Bayesian model assesses convergence, estimates solution probabilities, and provides a defensible stopping rule that balances accuracy and computational cost. Building on this process, we propose a taxonomy of solution spaces that offers clear diagnostics of partition reliability across algorithms and a shared vocabulary for interpretation. Applied to a real-world network, the approach shows that different algorithms produce various types of solution space, highlighting the importance of systematic exploration of the solutions before drawing scientific conclusions. Authors Fabio Morea, Domenico de Stefano Journal Scientific Reports Pubblication date 31/10/2025 Consult the pubblication

Go to publication

30/09/2024

A Comprehensive Analysis of Process Energy Consumption on Multi-Socket Systems with GPUs

Abstract: Robustly estimating energy consumption in High-Performance Computing (HPC) is essential for assessing the energy footprint of modern workloads, particularly in fields such as Artificial Intelligence (AI) research, development, and deployment. The extensive use of supercomputers for AI training has heightened concerns about energy consumption and carbon emissions. Existing energy estimation tools often assume exclusive use of computing nodes, a premise that becomes problematic with the advent of supercomputers integrating microservices, as seen in initiatives like Acceleration as a Service (XaaS) and cloud computing. This work investigates the impact of executed instructions on overall power consumption, providing insights into the comprehensive behaviour of HPC systems. We introduce two novel mathematical models to estimate a process’s energy consumption based on the total node energy, process usage, and a normalised vector of the probability distribution of instruction types for CPU and GPU processes. Our approach enables energy accounting for specific processes without the need for isolation. Our models demonstrate high accuracy, predicting CPU power consumption with a mere 1.9% error. For GPU predictions, the models achieve a central relative error of 9.7%, showing a clear tendency to fit the test data accurately. These results pave the way for new tools to measure and account for energy consumption in shared supercomputing environments. Authors: Luis G. Leon-Vega, Niccolò Tosato, Stefano Cozzini Journal: 11th Latin American High Performance Computing Conference, CARLA 2024 Publication date: 30/09/2024 Consult the publication

Go to publication

06/09/2024

Molecular simulations to investigate the impact of N6-methylation in RNA recognition: Improving accuracy and precision of binding free energy prediction

Abstract N6-Methyladenosine (m6A) is a prevalent RNA post-transcriptional modification that plays crucial roles in RNA stability, structural dynamics, and interactions with proteins. The YT521-B (YTH) family of proteins, which are notable m6A readers, functions through its highly conserved YTH domain. Recent structural investigations and molecular dynamics (MD) simulations have shed light on the mechanism of recognition of m6A by the YTHDC1 protein. Despite advancements, using MD to predict the stabilization induced by m6A on the free energy of binding between RNA and YTH proteins remains challenging due to inaccuracy of the employed force field and limited sampling. For instance, simulations often fail to sufficiently capture the hydration dynamics of the binding pocket. This study addresses these challenges through an innovative methodology that integrates metadynamics, alchemical simulations, and force-field refinement. Importantly, our research identifies hydration of the binding pocket as giving only a minor contribution to the binding free energy and emphasizes the critical importance of precisely tuning force-field parameters to experimental data. By employing a fitting strategy built on alchemical calculations, we refine the m6A partial charge parameters, thereby enabling the simultaneous reproduction of N6 methylation on both the protein binding free energy and the thermodynamic stability of nine RNA duplexes. Our findings underscore the sensitivity of binding free energies to partial charges, highlighting the necessity for thorough parametrization and validation against experimental observations across a range of structural contexts. Authors Valerio Piomponi, Miroslav Krepl, Jiri Sponer, Giovanni Bussi Journal The Journal of Physical Chemistry B, Vol 128, Issue 37 Publication Date 06/09/2024 Consult the paper

Go to publication

20/08/2024

Detach-ROCKET: Sequential feature selection for time series classification with random convolutional kernels

Abstract: Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket. Authors: Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Eric Fransén Journal: Data Mining and Knowledge Discovery Publication date: 20/08/2024 Consult the publication

Go to publication

30/07/2024

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Abstract Interpretability research aims to bridge the gap between empirical success and our scientific understanding of the inner workings of large language models (LLMs). However, most existing research focuses on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms and traces how one of them becomes dominant in the final prediction. We uncover how and where mechanisms compete within LLMs using two interpretability methods: logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms. Authors Francesco Ortu, Zhijing Jin, Diego Doimo, Mrinmaya Schaan, Alberto Cazzaniga, Bernhard Scholkopf Journal Accepted at the Annual Meeting of the Association for Computational Linquistics (ACL), Arxiv preprint: 2402.11655 Date 06/06/2024 Consult the paper

Go to publication

19/07/2024

Emergent representations in networks trained with the Forward-Forward algorithm

Abstract The Backpropagation algorithm has often been criticised for its lack of biological realism. In an attempt to find a more biologically plausible alternative, the recently introduced Forward-Forward algorithm replaces the forward and backward passes of Backpropagation with two forward passes. In this work, we show that the internal representations obtained by the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity – composed of a low number of active units. This situation is reminiscent of what has been observed in cortical sensory areas, where neuronal ensembles are suggested to serve as the functional building blocks for perception and action. Interestingly, while this sparse pattern does not typically arise in models trained with standard Backpropagation, it can emerge in networks trained with Backpropagation on the same objective proposed for the Forward-Forward algorithm. These results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex, even when a backward pass is used. Authors Niccolò Tosato, Lorenzo Basile, Emanuele Ballarin, Giuseppe de Alteriis, Alberto Cazzaniga, Alessio Ansuini Journal Submitted at Advances in Neural Information Processing Systems 37 (NEURIPS 2024) Main Conference Track. Arxiv preprint: 2305.18353 Date 19/06/2024 Consult the paper

Go to publication

18/07/2024

Enhancing Multi-Tip Artifact Detection in STM Images Using Fourier Transform and Vision Transformers

Abstract We address the issue of multi-tip artifacts in Scanning Tunneling Microscopy (STM) images by applying the fast Fourier transform (FFT) as a feature engineering method. We fine-tune various neural network architectures using a synthetic dataset, including Vision Transformers (ViT). The FFT-based preprocessing significantly improves the performance of ViT models compared to using only the grayscale channel. Ablation experiments highlight the optimal conditions for synthetic dataset generation. Unlike traditional methods that are challenging to implement for large datasets and used offline, our method enables on-the-fly classification at scale. Our findings demonstrate the efficacy of combining the Fourier transform with deep learning for enhanced artifact detection in STM images, contributing to more accurate analysis in material science research. Authors Tommaso Rodani, Alessio Ansuini, Alberto Cazzaniga Journal ICML ’24 Workshop ML for Life and Material Science: From Theory to Industry Applications Publication date 17/07/2024 Consult the paper

Go to publication

16/07/2024

Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models

Abstract Protein language models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a protein language model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data preprocessing to mitigate overfitting. Authors Francesca Cuturello, Marco Celoria, Alessio Ansuini, Alberto Cazzaniga Journal Bioinformatics, 2024, 40 (7) Consult the paper

Go to publication

01/06/2024

Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering

Abstract Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository. Authors Federico Barone, Elena Tea Russo, Edith Natalia Villegas Garcia, Marco Punta, Stefano Cozzini, Alessio Ansuini, Alberto Cazzaniga Journal Scientific Data 11, 568 (2024) Date 01/06/2024 Consult the paper

Go to publication

21/09/2023

The geometry of hidden representations of large transformer models

Abstract Large transformers are powerful architectures used for self-supervised data analysis across various data types, including protein sequences, images, and text. In these models, the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next. We characterize the geometric and statistical properties of these representations and how they change as we move through the layers.By analyzing the intrinsic dimension (ID) and neighbor composition, we find that the representations evolve similarly in transformers trained on protein language taskand image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets.Based on our findings, we point out an explicit strategy to identify, without supervision, the layers that maximize semantic content: representations at intermediate layers corresponding to a relative minimum of the ID profile are more suitable for downstream learning tasks. Authors Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, Alberto Cazzaniga Journal Advances in Neural Information Processing Systems 36 (NEURIPS 2023) Main Conference Track Date 21/09/2023 Consult the paper

Go to publication

05/05/2021

Speeding‐up pruning for Artificial Neural Networks: Introducing Accelerated Iterative Magnitude Pruning

Abstract: In recent years, Artificial Neural Networks (ANNs) pruning has become the focal point of many researches, due to the extreme overparametrization of such models. This has urged the scientific world to investigate methods for the simplification of the structure of weights in ANNs, mainly in an effort to reduce time for both training and inference. Frankle and Carbin [1], and later Renda, Frankle, and Carbin [2] introduced and refined an iterative pruning method which is able to effectively prune the network of a great portion of its parameters with little to no loss in performance. On the downside, this method requires a large amount of time for its application, since, for each iteration, the network has to be trained for (almost) the same amount of epochs of the unpruned network. In this work, we show that, for a limited setting, if targeting high overall sparsity rates, this time can be effectively reduced for each iteration, save for the last one, by more than 50%, while yielding a final product (i.e., final pruned network) whose performance is comparable to the ANN obtained using the existing method. Authors: Marco Zullich, Eric Medvet; Felice Andrea Pellegrino; Alessio Ansuini Journal: 2020 25th International Conference on Pattern Recognition (ICPR) Publication date: 05/05/2021 Consult the publication

Go to publication

23/12/2020

Investigating Similarity Metrics for Convolutional Neural Networks in the Case of Unstructured Pruning

Abstract: Deep Neural Networks (DNNs) are essential tools of modern science and technology. The current lack of explainability of their inner workings and of principled ways to tame their architectural complexity triggered a lot of research in recent years. There is hope that, by making sense of representations in their hidden layers, we could collect insights on how to reduce model complexity—without performance degradation—by pruning useless connections. It is natural then to ask the following question: how similar are representations in pruned and unpruned models? Even small insights could help in finding principled ways to design good lightweight models, enabling significant savings of computation, memory, time and energy. In this work, we investigate empirically this problem on a wide spectrum of similarity measures, network architectures and datasets. We find that the results depend critically on the similarity measure used and we discuss briefly the origin of these differences, concluding that further investigations are required in order to make substantial advances. Authors: Alessio Ansuini, Eric Medvet, Felice Andrea Pellegrino, Marco Zullich Journal: International Conference on Pattern Recognition Applications and Methods (ICPRAM) Publication date: 23/12/2020 Consult the publication

Go to publication