Optimizing Data Pre-Processing Transformations with Reinforcement Learning

Graduate Thesis Or Dissertation

Citations

Citeable URL: https://scholar.colorado.edu/concern/graduate_thesis_or_dissertations/cf95jd01d

Abstract

In this work, we use Reinforcement Learning (RL) to optimize the data pre-processing transformations in a machine learning pipeline given a dataset X, child algorithm f(·), and action space A. Inspired by Effective data pre-processing for AutoML [4] and Learn2Clean [1], we construct a model that 1) does not specify a data pre-processing pipeline structure in advance and 2) does not depend on transformation specific rules or empirical calculations across multiple datasets. Using a simple policy optimization scheme, we produce comparable results to [4] across multiple datasets from the OpenML-CC18 benchmark suite and with Naive Bayes (NB) as our child algorithm. This was accomplished by only finding an optimal order of actions sampled from the action space A, where the parameters of each action were kept to their default values. We hope that this model can serve as a basis for future projects, including the study of how such data transformations affect the manifold structure as well as implementing a conditional aspect to our model to make it more efficient.

Creator