# Accounting for Factor Variables in Big Data Regression

Tonglin Zhang, Baijian Yang

December 2020

### Abstract

Continuous and factor explanatory variables are both important in linear regressions. To fit a linear model using factor variables, the traditional implementation of the least squares approach defines a number of dummy variables. However, this approach is difficult to apply to big data because the size of the design matrix can be inflated significantly by a factor variable, even if the number of factor levels is only moderately large. By treating the factor variable as an index, this study proposes a new approach, called the index least squares approach, to overcome this difficulty. Combined with the technique of scanning data by rows, the index least squares approach can provide exact solutions simultaneously to a group of linear models with factor variables. Therefore, it avoids the memory barrier caused by the size of the design matrix. Because the memory needed is unrelated to the number of observations, the index least squares approach can be used even when the size of a massive data set is hundreds of times higher than the memory available to the computing system.

Publication

Statistica Sinica

###### Professor of Computer and Information Technology

My research interests include applied machine learning, big data and cybersecurity.