Article
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means Public Deposited
Downloadable Content
Download PDF
https://scholar.colorado.edu/concern/articles/8w32r634h
- Abstract
- We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.
- Creator
- Date Issued
- 2017-05-01
- Academic Affiliation
- Journal Title
- Journal Issue/Number
- 5
- Journal Volume
- 63
- Last Modified
- 2019-12-05
- Resource Type
- Rights Statement
- DOI
- ISSN
- 0018-9448
- Language
Relationships
Items
Thumbnail | Title | Date Uploaded | Visibility | Actions |
---|---|---|---|---|
preconditionedDataSparsificationForBigDataWithApplications.pdf | 2019-12-05 | Public | Download |