An exhaustive search regression built on base R

Usage

regsearch(
  data,
  dependent,
  independent,
  minvar = 1,
  maxvar,
  family,
  topN = 0,
  interactions = FALSE,
  multi = FALSE,
  ...
)

Arguments

data: A `data.frame` that contains a dependent variable and the independent variables.
dependent: The dependent variable for the regression.
independent: A vector of independent variables to be used. These must match the column names from `data`. These can also include interaction terms made from column names from `data`. This allows for specific interaction terms to be used, rather than every possible interaction as is done with `interactions = TRUE`.
minvar: (Optional) The minimum number of independent variables to be used in the regression. Defaults to 1.
maxvar: The maximum number of independent variables to be used in the regression. Must be equal to or less than the number of independent variables. If interaction terms are used, they count as one independent variable.
family: The type of regression. Passed to `glm`. See glm for more information.
topN: (Optional) The number of top results to be printed upon run completion. Defaults to 0.
interactions: (Optional) A boolean indicating whether or not interaction terms should be used. Defaults to `FALSE`.
multi: (Optional) A boolean indicating whether or not multithreading should be used. Defaults to `FALSE`. It is highly recommended to use multithreading.
...: (Optional) Function arguments to be passed to glm

Value

Returns a `data.table` of information on the regressions run. The resulting data.table is sorted in descending order by the rSquare divided by the mean p-value. This is generally reliable in pushing quality regressions to the top of the list.

`formula`: The regression formula used.
`aic`: The aic for the regression.
`rSquare`: The calculated r-square for the regression.
`warn`: Currently unused.
independent: Each variable column contains the p-values for that variable or interaction term in a given regression.

Examples

# Creating dummy data
dt <- data.frame("dependent" = sample(c(0, 1), 100, replace = TRUE),
"ind_1" = runif(100, 0, 1),
"ind_2" = runif(100, 0, 1),
"ind_3" = runif(100, 0, 1),
"ind_4" = runif(100, 0, 1))

# Without interaction terms and multithreading
## Two top results
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", 2)
#> Warning: Missing 'interactions' argument. Defaulting to FALSE.
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Assembling regresions..."
#> [1] "Creating 15 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 15 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#> 
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.4949  -1.1981   0.9401   1.1158   1.3594  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.5011     0.5154  -0.972    0.331
#> ind_1         0.6554     0.7359   0.891    0.373
#> ind_4         0.7032     0.7656   0.918    0.358
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 138.27  on 99  degrees of freedom
#> Residual deviance: 136.51  on 97  degrees of freedom
#> AIC: 142.51
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#> 
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.462  -1.221   0.939   1.128   1.372  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.5605     0.5882  -0.953    0.341
#> ind_1         0.6461     0.7374   0.876    0.381
#> ind_2         0.1348     0.6406   0.210    0.833
#> ind_4         0.6959     0.7662   0.908    0.364
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 138.27  on 99  degrees of freedom
#> Residual deviance: 136.46  on 96  degrees of freedom
#> AIC: 144.46
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#>                                         formula      aic      bic rSquare warn
#>  1:                 dependent ~ + ind_1 + ind_4 142.5053 150.3208 0.01276   NA
#>  2:         dependent ~ + ind_1 + ind_2 + ind_4 144.4609 154.8816 0.01308   NA
#>  3:         dependent ~ + ind_1 + ind_3 + ind_4 144.5023 154.9230 0.01278   NA
#>  4:                         dependent ~ + ind_4 141.3038 146.5141 0.00698   NA
#>  5: dependent ~ + ind_1 + ind_2 + ind_3 + ind_4 146.4584 159.4842 0.01310   NA
#>  6:                         dependent ~ + ind_1 141.3572 146.5675 0.00660   NA
#>  7:                 dependent ~ + ind_2 + ind_4 143.2334 151.0489 0.00749   NA
#>  8:                 dependent ~ + ind_3 + ind_4 143.2697 151.0852 0.00723   NA
#>  9:                 dependent ~ + ind_1 + ind_2 143.2936 151.1092 0.00706   NA
#> 10:         dependent ~ + ind_2 + ind_3 + ind_4 145.2018 155.6225 0.00772   NA
#> 11:                 dependent ~ + ind_1 + ind_3 143.3571 151.1726 0.00660   NA
#> 12:         dependent ~ + ind_1 + ind_2 + ind_3 145.2936 155.7143 0.00706   NA
#> 13:                 dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085   NA
#> 14:                         dependent ~ + ind_2 142.1719 147.3823 0.00070   NA
#> 15:                         dependent ~ + ind_3 142.2475 147.4578 0.00016   NA
#>     xIntercept     ind_1     ind_2     ind_3     ind_4
#>  1:  0.3308671 0.3731085        NA        NA 0.3583815
#>  2:  0.3406195 0.3809417 0.8332921        NA 0.3637671
#>  3:  0.4019967 0.3824943        NA 0.9569377 0.3575983
#>  4:  0.6202945        NA        NA        NA 0.3285351
#>  5:  0.3955666 0.3899512 0.8339157 0.9594924 0.3630483
#>  6:  0.5776676 0.3414683        NA        NA        NA
#>  7:  0.5817247        NA 0.7908741        NA 0.3352401
#>  8:  0.6359655        NA        NA 0.8536337 0.3255300
#>  9:  0.5479413 0.3504516 0.8009956        NA        NA
#> 10:  0.5894053        NA 0.7944590 0.8589336 0.3322281
#> 11:  0.6558815 0.3471964        NA 0.9928694        NA
#> 12:  0.6091045 0.3558050 0.8010777 0.9959040        NA
#> 13:  0.9547657        NA 0.7578417 0.8886398        NA
#> 14:  0.9527065        NA 0.7552192        NA        NA
#> 15:  0.8802303        NA        NA 0.8827580        NA
## No top results
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", FALSE, FALSE)
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Assembling regresions..."
#> [1] "Creating 15 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 15 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#>                                         formula      aic      bic rSquare warn
#>  1:                 dependent ~ + ind_1 + ind_4 142.5053 150.3208 0.01276   NA
#>  2:         dependent ~ + ind_1 + ind_2 + ind_4 144.4609 154.8816 0.01308   NA
#>  3:         dependent ~ + ind_1 + ind_3 + ind_4 144.5023 154.9230 0.01278   NA
#>  4:                         dependent ~ + ind_4 141.3038 146.5141 0.00698   NA
#>  5: dependent ~ + ind_1 + ind_2 + ind_3 + ind_4 146.4584 159.4842 0.01310   NA
#>  6:                         dependent ~ + ind_1 141.3572 146.5675 0.00660   NA
#>  7:                 dependent ~ + ind_2 + ind_4 143.2334 151.0489 0.00749   NA
#>  8:                 dependent ~ + ind_3 + ind_4 143.2697 151.0852 0.00723   NA
#>  9:                 dependent ~ + ind_1 + ind_2 143.2936 151.1092 0.00706   NA
#> 10:         dependent ~ + ind_2 + ind_3 + ind_4 145.2018 155.6225 0.00772   NA
#> 11:                 dependent ~ + ind_1 + ind_3 143.3571 151.1726 0.00660   NA
#> 12:         dependent ~ + ind_1 + ind_2 + ind_3 145.2936 155.7143 0.00706   NA
#> 13:                 dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085   NA
#> 14:                         dependent ~ + ind_2 142.1719 147.3823 0.00070   NA
#> 15:                         dependent ~ + ind_3 142.2475 147.4578 0.00016   NA
#>     xIntercept     ind_1     ind_2     ind_3     ind_4
#>  1:  0.3308671 0.3731085        NA        NA 0.3583815
#>  2:  0.3406195 0.3809417 0.8332921        NA 0.3637671
#>  3:  0.4019967 0.3824943        NA 0.9569377 0.3575983
#>  4:  0.6202945        NA        NA        NA 0.3285351
#>  5:  0.3955666 0.3899512 0.8339157 0.9594924 0.3630483
#>  6:  0.5776676 0.3414683        NA        NA        NA
#>  7:  0.5817247        NA 0.7908741        NA 0.3352401
#>  8:  0.6359655        NA        NA 0.8536337 0.3255300
#>  9:  0.5479413 0.3504516 0.8009956        NA        NA
#> 10:  0.5894053        NA 0.7944590 0.8589336 0.3322281
#> 11:  0.6558815 0.3471964        NA 0.9928694        NA
#> 12:  0.6091045 0.3558050 0.8010777 0.9959040        NA
#> 13:  0.9547657        NA 0.7578417 0.8886398        NA
#> 14:  0.9527065        NA 0.7552192        NA        NA
#> 15:  0.8802303        NA        NA 0.8827580        NA

# With interaction terms and multithreading
regsearch(dt, "dependent", c("ind_1", "ind_2", "ind_3", "ind_4"),
1, 4, "binomial", TRUE, TRUE)
#> Warning: Missing 'multi' argument. Defaulting to FALSE.
#> [1] "Gathering variables..."
#> [1] "WARNING: Using interaction terms without multithreading may take a very long time"
#> [1] "Assembling regresions..."
#> [1] "Creating 385 formulas. Please be patient, this may take a while."
#> [1] "Creating regressions..."
#> [1] "Running 105 regressions. Please be patient, this may take a while."
#> [1] "Running regressions..."
#> 
#> Call:
#> glm(formula = as.formula(x), family = family, data = data)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.6655  -1.1921   0.8124   1.1342   1.4298  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)   0.6204     0.7692   0.807    0.420
#> ind_3        -1.5993     1.3131  -1.218    0.223
#> ind_4        -1.5214     1.6048  -0.948    0.343
#> ind_3:ind_4   4.5781     2.8658   1.598    0.110
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 138.27  on 99  degrees of freedom
#> Residual deviance: 134.62  on 96  degrees of freedom
#> AIC: 142.62
#> 
#> Number of Fisher Scoring iterations: 4
#> 
#>                                        formula      aic      bic rSquare warn
#>   1:                 dependent ~ + ind_3*ind_4 142.6173 153.0380 0.02641   NA
#>   2:         dependent ~ + ind_3*ind_4 + ind_1 144.0950 157.1209 0.03019   NA
#>   3:         dependent ~ + ind_3*ind_4 + ind_2 144.4206 157.4464 0.02783   NA
#>   4: dependent ~ + ind_3*ind_4 + ind_1 + ind_2 145.9382 161.5692 0.03132   NA
#>   5:   dependent ~ + ind_1*ind_4 + ind_3*ind_4 145.6978 161.3288 0.03306   NA
#>  ---                                                                         
#> 101:         dependent ~ + ind_2*ind_3 + ind_1 147.2936 160.3194 0.00706   NA
#> 102:               dependent ~ + ind_2 + ind_3 144.1523 151.9678 0.00085   NA
#> 103:                       dependent ~ + ind_2 142.1719 147.3823 0.00070   NA
#> 104:                 dependent ~ + ind_2*ind_3 146.1511 156.5718 0.00085   NA
#> 105:                       dependent ~ + ind_3 142.2475 147.4578 0.00016   NA
#>      xIntercept     ind_1     ind_2     ind_3     ind_4 ind_1.ind_2 ind_1.ind_3
#>   1:  0.4199248        NA        NA 0.2232378 0.3431217          NA          NA
#>   2:  0.6678149 0.4705920        NA 0.2220420 0.3624368          NA          NA
#>   3:  0.5224838        NA 0.6576698 0.2091700 0.3202841          NA          NA
#>   4:  0.7486319 0.4880024 0.6923043 0.2097495 0.3405966          NA          NA
#>   5:  0.9716514 0.3596372        NA 0.2027834 0.7627338          NA          NA
#>  ---                                                                           
#> 101:  0.7407833 0.3561515 0.9149675 0.9962740        NA          NA          NA
#> 102:  0.9547657        NA 0.7578417 0.8886398        NA          NA          NA
#> 103:  0.9527065        NA 0.7552192        NA        NA          NA          NA
#> 104:  0.9503169        NA 0.8640432 0.9223537        NA          NA          NA
#> 105:  0.8802303        NA        NA 0.8827580        NA          NA          NA
#>      ind_1.ind_4 ind_2.ind_3 ind_2.ind_4 ind_3.ind_4 ind_3.ind_2 ind_4.ind_2
#>   1:          NA          NA          NA   0.1101492          NA          NA
#>   2:          NA          NA          NA   0.1273546          NA          NA
#>   3:          NA          NA          NA   0.1024724          NA          NA
#>   4:          NA          NA          NA   0.1194127          NA          NA
#>   5:   0.5303457          NA          NA          NA          NA          NA
#>  ---                                                                        
#> 101:          NA   0.9935548          NA          NA          NA          NA
#> 102:          NA          NA          NA          NA          NA          NA
#> 103:          NA          NA          NA          NA          NA          NA
#> 104:          NA   0.9717019          NA          NA          NA          NA
#> 105:          NA          NA          NA          NA          NA          NA
#>      ind_4.ind_3
#>   1:          NA
#>   2:          NA
#>   3:          NA
#>   4:          NA
#>   5:    0.107964
#>  ---            
#> 101:          NA
#> 102:          NA
#> 103:          NA
#> 104:          NA
#> 105:          NA