High Dimensional PCA: A New Model Selection Criterion

Published in arXiv preprint, 2020

Given a random sample from a multivariate population, estimating the number of large eigenvalues of the population covariance matrix is an important problem in Statistics with wide applications in many areas. In the context of Principal Component Analysis (PCA), this number determines the linear combinations of the original variables that exhibit the most variation.

In this paper, we study the high-dimensional asymptotic regime, where the number of variables grows at the same rate as the number of observations. We utilize the spiked covariance model proposed by Johnstone (2001), reducing the problem to model selection. Our focus is on the Akaike Information Criterion (AIC), known to be strongly consistent from the work of Bai et al. (2018). However, Bai et al. (2018) requires a “gap condition” that ensures the dominant eigenvalues are above a threshold larger than the BBP threshold (Baik et al., 2005), both depending on the limiting ratio of the number of variables and observations.

We explore whether strong consistency of AIC can still hold when the “gap” is made smaller. We show that strong consistency under an arbitrarily small gap is achievable if the penalty term of AIC is suitably altered based on the target gap. Furthermore, we demonstrate that an intuitive alteration of the penalty term can result in the gap being exactly zero, although in this case, we only achieve weak consistency.

PDF


title: “Sampling random graphs with specified degree sequences” collection: publications permalink: /publication/2020-11-09-MCMC date: 2020-08-17 venue: “ICPP ‘20: 49th International Conference on Parallel Processing” —

[Paper] [Video]

Abstract: Federated Learning (FL) is a fast-developing distributed machine learning technique involving the participation of a massive number of user devices. While FL has benefits of data privacy and the abundance of user-generated data, its challenges of heterogeneity across users’ data and devices complicate algorithm design and convergence analysis. To tackle these challenges, we propose an algorithm that exploits proximal stochastic variance reduced gradient methods for non-convex FL. The proposed algorithm consists of two nested loops, which allow user devices to update their local models approximately up to an accuracy threshold (inner loop) before sending these local models to the server for global model update (outer loop). We characterize the convergence conditions for both local and global model updates and extract various insights from these conditions via the algorithm’s parameter control. We also propose how to optimize these parameters such that the training time of FL is minimized. Experimental results not only validate the theoretical convergence but also show that the proposed algorithm outperforms existing Stochastic Gradient Descent-based methods in terms of convergence speed in FL setting.

Recommended citation: Abhinav Chakraborty, Soumendu Sundar Mukherjee, Arijit Chakrabarti. (2020). "High Dimensional PCA: A New Model Selection Criterion." arXiv preprint arXiv:2011.04470.
Download Paper