

Project Description
Pick your favourite high-dimensional dataset and analyse in order to extract and produce insights as to its properties.
Emphasis here is not on implementing machine learning models per se but on the analysis of the data itself and understanding it on its own.
Various examples will be offered on this page. You are not obligated to any of these and are free to design your own data analysis.
Many of the options here involve considereable self-study and resaerch through which you can gain entrance into the world of data analysis.
Whatever your choose must be well-thought-out and reasoned and such matters must appear in your final report.

Submission
Guidelines
You are to adhere to the following.
-
Submit a public GitHub link containing your project.
-
Your project will contain a pdf file that will provide a thorough account detailing the merits of the dataset chosen for analysis as well as your findings and the methods used in order to obtain those.
-
Your report will be prepared in English (at the Cambridge level).
-
Your report will contain various visualisations of the dataset in a manner corresponding to your analysis.
-
Python code will be present in your GitHub link along side the report.


Project Grade
Here are some highlights regarding the manner in which the grade will be determined.
-
You are to impress the lecturer with the dataset chosen and the techniques used to analyse it.
- The higher is the level of technicality and the deeper are your insights about the data the higher your grade will be.
- You are expected to defend everything you chose to present in your report.
- Make sure your report is exhuastive and complete.
- Make sure that the report looks professional and avoid embarresing mistakes (e.g. perfect English).
- You MUST use LLMs or Cursor AI in order to generate most of your code. You are to document properly your interaction with AI-tools.
- Avoiding heavy utilisation of AI-tools will result in a very low grade.

Intrinsic Dimension Estimation
Implement methods for estimating the intrinsic dimension of high-dimensional data, such as:
-
Correlation dimension
-
PCA-based dimensionality estimation
-
k-nearest neighbour (k-NN) based estimators
-
Fisher separability analysis
Intrinsic Dimension Estimation:
Implementation Idea
Goal
Write a Python program to estimate the correlation dimension of a high-dimensional dataset. Compare different datasets to see how dimensionality behaves in real-world cases.
Steps to Implement
-
Generate Synthetic Datasets:
-
High-dimensional Gaussian noise (expected D_2≈d).
-
Points sampled from a low-dimensional submanifold (e.g., a 2D plane in 10D space).
-
-
Compute the Correlation Integral:
-
Calculate pairwise distances.
-
Count the number of pairs within radius r.
-
Compute C(r) for different values of r.
-
-
Estimate D_2 from the Log-Log Plot:
-
Plot logC(r) vs. log r.
-
Use linear regression to estimate the slope.
-
-
Compare Different Datasets:
-
High-dimensional Gaussian noise.
-
Low-dimensional manifold data.
-
Real-world datasets (e.g., PCA-reduced image data).
-
Detailed Example: Correlation Dimension
The correlation dimension is a method for estimating the intrinsic dimensionality of a dataset. Unlike the ambient dimension (i.e., the number of features), the correlation dimension quantifies how the data points are distributed in space and how they scale as you zoom in. It is particularly useful for analysing high-dimensional datasets where the true degrees of freedom may be much lower than the number of observed features.

Why is Correlation Dimension Useful?
-
Identifies Intrinsic Dimensionality:
-
If the data lies on a lower-dimensional manifold within a high-dimensional space, the correlation dimension will be much smaller than the ambient dimension.
-
-
Useful for High-Dimensional Geometry:
-
Many real-world datasets (e.g., natural images, biological data) lie on low-dimensional structures, despite having many observed variables.
-
-
Differentiates Random vs. Structured Data:
-
Random high-dimensional points tend to fill the space, yielding a correlation dimension close to the ambient dimension.
-
Structured datasets (e.g., points on a low-dimensional manifold) have a correlation dimension significantly lower than the ambient space.
-
Intrinsic Dimension Estimation
We have outlined but a single option in this category. Naturally, this is not impressive enough for an entire project. You should strive to implement more techniques.
If you are unsure regarding any of the other options, then simply ask chatGPT to give you suggestions. For instance, here is what chatGPT has to say for the PCA option of this category.

Additional Project Examples
Concentration of Measure and Random Projections
-
Empirically verify concentration inequalities in high dimensions (e.g., Hoeffding’s inequality, Chernoff bounds).
-
Implement Johnson-Lindenstrauss Lemma for random projections and visualise how distances are preserved in lower dimensions.
Random Matrix Theory in High-Dimensional Data
-
Compute and study empirical spectral distributions of random matrices.
-
Verify results such as the Marčenko-Pastur law and the semicircle law.
-
Investigate the condition number of large Gaussian random matrices.
High-Dimensional Anomaly Detection via Statistical Methods
-
Develop methods for detecting outliers based on Mahalanobis distance and leverage statistical thresholds.
-
Explore how classical statistical tools (e.g., z-scores, quantile-based methods) behave in high dimensions.
Empirical Verification of the Curse of Dimensionality
Explore how distance metrics behave in high dimensions by analysing:
-
Pairwise distances between points
-
Nearest-neighbour distributions
-
Ratio of volume of high-dimensional spheres to their bounding cubes
High-Dimensional Data Visualisation via Projections
Implement different projection techniques to study structure in high-dimensional data:
-
Random 2D projections
-
PCA and its variance explanation
-
t-SNE/UMAP (for exploratory analysis, not classification)
Covariance and Correlation in High Dimensions
-
Study the properties of empirical covariance matrices in high-dimensional regimes.
-
Explore shrinkage estimators (e.g., Ledoit-Wolf shrinkage) and their effect on estimated eigenvalues.
-
Compare the sample correlation matrix with the population correlation matrix.
Empirical Behaviour of High-Dimensional Convex Hulls
-
Generate random points in high dimensions and study properties of their convex hulls.
-
Analyse the fraction of points that lie on the boundary in high dimensions.
Gaussian Annulus Theorem and High-Dimensional Geometry
-
Empirically test and illustrate the theorem stating that high-dimensional Gaussian distributions concentrate in an annulus.
-
Compute distances of high-dimensional Gaussian points from the origin and analyse their distribution.
Sparse Recovery and Compressed Sensing Experiments
-
Implement experiments on sparse signal recovery using random projections.
-
Investigate phase transition phenomena in the success of sparse recovery based on the number of measurements.