Abstract:
Real-world data in biology, material science, medicine and beyond typically contain a large number of features that are heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, say of two cells or two patients, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for a successful data analysis.
We introduce a statistical test that can assess the relative information retained when using different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This test can be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. The approach can be used to perform feature selection in molecular modeling and clinical analysis, and to infer causality in high-dimensional time series.