PhD student: Ottavio Khalifa
Title: Clustering longitudinal categorical data
Supervisor: François Petit
Doctoral school: ED 393 Epidemiology and Biomedical Information Sciences, Université Paris Cité
Thesis topic:
Longitudinal data are frequently used in clinical epidemiology, to characterize the evolution of a pathology, or the response to a treatment, over time. Numerous approaches exist for analyzing longitudinal data, and one of these is to seek to identify family trajectories, which can either inform patients or physicians of the likely course of the disease, or target groups for specific groups for specific interventions.
These trajectory families are generally identified using clustering methods. The question considered here can be seen as a time series clustering problem. One of the difficulties is that clustering methods for categorical data and in particular for categorical time series are few and far between in the literature. Progress has recently been made for clustering in the static context, but much remains to be done in the dynamic context (longitudinal data).
This thesis project focuses on the evaluation and development of clustering methods for time series with categorical (or mixed) values.
1/ Which clustering methods are suitable for identifying patient trajectories where the essential data are categorical in nature?
The data encountered in stratified medicine present certain particularities: numerous categorical variables, sparse data, etc. Here, one of the major challenges will be to identify symptom evolution trajectories. Although a number of time series clustering methods with categorical values exist, these techniques are poorly documented and have not been exhaustively compared. We will evaluate them intensively. This will enable us to identify the most suitable methods for clustering symptom trajectories. We will then apply the identified method to the clustering of trajectories of patients suffering from long Covid.
2/ Can time series clustering methods based on topological data analysis techniques be adapted to the case of time series with categorical values?
We will use synthetic data generated from models to be developed as part of the thesis. We will also use data from the ComPaRe Covid long cohort. These data describe the evolution of the Covid long symptoms of around 1,200 patients over time, with questionnaires every 60 days over more than 2 years.
We will begin by carrying out a systematic review of the literature to identify potentially exploitable clustering methods. We will then study the possibility of adapting topological or geometric clustering methods to the case of time series with categorical values. Articles are a possible starting point. The various methods will be evaluated on synthetic data generated from models developed as part of this work. These will enable a large number of scenarios to be explored. This will require reflection on the relevant metrics for evaluating the different algorithms. Finally, the clustering method selected will be implemented on the ComPaRe long Covid cohort.