Benchmarking distance-based partitioning methods for mixed-type data
Authors:Efthymios Costa, Ioanna Papatsouma, and Angelos Markos
Advances in Data Analysis and Classification, 2023
Clustering mixed-type data, that is, observation by variable data that consist of both continuous and categorical variables poses novel challenges. Foremost among these challenges is the choice of the most appropriate clustering method for the data. This paper presents a benchmarking study comparing eight distance-based partitioning methods for mixed-type data in terms of cluster recovery performance. A series of simulations carried out by a full factorial design are presented that examined the effect of a variety of factors on cluster recovery. The amount of cluster overlap, the percentage of categorical variables in the data set, the number of clusters and the number of observations had the largest effects on cluster recovery and in most of the tested scenarios. KAMILA, K-Prototypes and sequential Factor Analysis and K-Means clustering typically performed better than other methods. The study can be a useful reference for practitioners in the choice of the most appropriate method.
@article{costa2023benchmarking, title={Benchmarking distance-based partitioning methods for mixed-type data}, author={Costa, Efthymios and Papatsouma, Ioanna and Markos, Angelos}, journal={Advances in Data Analysis and Classification}, volume={17}, number={3}, pages={701--724}, year={2023}, publisher={Springer} }
Preprints
Cluster Analysis
A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data
Authors:Efthymios Costa, Ioanna Papatsouma, and Angelos Markos
Under review, 2024
In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach is built on the deterministic variant of the Information Bottleneck algorithm, designed to optimally compress data while preserving its relevant structural information. We evaluate the performance of our method against four well-established clustering techniques for mixed-type data -- KAMILA, K-Prototypes, Factor Analysis for Mixed Data with K-Means, and Partitioning Around Medoids using Gower's dissimilarity -- using both simulated and real-world datasets. The results highlight that the proposed approach offers a competitive alternative to traditional clustering techniques, particularly under specific conditions where heterogeneity in data poses significant challenges.
@misc{costa2024dibmix, title={A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data}, author={Costa, Efthymios and Papatsouma, Ioanna and Markos, Angelos}, year={2024}, eprint={2407.03389}, archivePrefix={arXiv}, primaryClass={stat.ME}, howpublished = {arXiv preprint}, url = {https://arxiv.org/abs/2407.03389}
Outlier Detection
A novel framework for quantifying nominal outlyingness
Authors:Efthymios Costa, and Ioanna Papatsouma
Under review, 2024
Outlier detection is an important data mining tool that becomes particularly challenging when dealing with nominal data. First and foremost, flagging observations as outlying requires a well-defined notion of nominal outlyingness. This paper presents a definition of nominal outlyingness and introduces a general framework for quantifying outlyingness of nominal data. The proposed framework makes use of ideas from the association rule mining literature and can be used for calculating scores that indicate how outlying a nominal observation is. Methods for determining the involved hyperparameter values are presented and the concepts of variable contributions and outlyingness depth are introduced, in an attempt to enhance interpretability of the results. The proposed framework is evaluated on both synthetic and real-world data sets, demonstrating comparable performance to state-of-the-art frequent pattern mining algorithms and even outperforming them in certain cases. The ideas presented can serve as a tool for assessing the degree to which an observation differs from the rest of the data, under the assumption of sequences of nominal levels having been generated from a Multinomial distribution with varying event probabilities.
@misc{costa2024nominalouts, title={A novel framework for quantifying nominal outlyingness}, author={Efthymios Costa and Ioanna Papatsouma}, year={2024}, eprint={2408.07463}, archivePrefix={arXiv}, primaryClass={stat.ME}, howpublished = {arXiv preprint}, url = {https://arxiv.org/abs/2408.07463}
Conference Papers
Cluster Analysis
A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data
Authors:Efthymios Costa, Ioanna Papatsouma, and Angelos Markos
Data Science, Classification, and Artificial Intelligence for Modeling Decision Making (IFCS 2024), 2025
In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The method is a variant of the Deterministic Information Bottleneck algorithm which optimally compresses the data while retaining relevant information about the underlying structure. We compare the performance of the proposed method to that of three well-established clustering methods (KAMILA, K-Prototypes, and Partitioning Around Medoids with Gower’s dissimilarity) on simulated and real-world datasets. The results demonstrate that the proposed approach represents a competitive alternative to conventional clustering techniques under specific conditions.
@inproceedings{costa2024deterministic, title={A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data}, author={Costa, Efthymios and Papatsouma, Ioanna and Markos, Angelos}, booktitle={Conference of the International Federation of Classification Societies}, pages={81--88}, year={2024}, organization={Springer} }
Discussion Contributions
Statistical Inference
Contribution to the discussion of Dümbgen and Davies (2025) ''Connecting Model-Based and Model-Free Approaches to Linear Least Squares Regression''
Authors:Efthymios Costa, and Ioanna Papatsouma
Statistica (to appear), 2025
Editorial & Peer-Review Service
Below is a list of journals in which I have served as a peer reviewer.