Fitting linear regression models on nested data with purrr

If you have a dataset you'd like to break into chunks and fit a model for each chunk, and you don't want to copy pasta the same code over and over, or use a clunky for loop, read on. You can do it in less than 10 lines of code using purrr::map.

Header photo by Tara Evans on Unsplash. Full code on Github.

Getting Some Data

Let's begin by generating some data using the ravelRy package (shameless self-plug), which makes knitting and crochet patterns available via Ravelry.com's API. You can follow the steps here for installing the package, creating a Ravelry developer account, and authenticating via the package.

To create a dataset, let's pass four different search terms to the ravelRy::search_patterns function using purrr:map: scarf, hat, sweater, and blanket. For each, let's restrict to crochet patterns, sort by popularity, and retrieve the top 200. The returned ids can then be used to get pattern details using the function ravelRy::get_patterns.

queries <- data.frame(query = c('scarf', 'hat', 'sweater', 'blanket')) %>%
  mutate(results = map(query, ~ search_patterns(query     = .,
                                                craft     = 'crochet',
                                                page_size = 200,
                                                sort      = 'popularity')))

ids <- queries %>%
  unnest(results)

pattern_details <- get_patterns(ids = ids$id)

patterns <- ids %>%
  select(query, id) %>%
  left_join(pattern_details, by = 'id') %>%
  mutate(yardage = as.numeric(yardage)) %>%
  filter(!is.na(yardage), !is.na(difficulty_average))

Looking at The Data

The relationship of interest here is the impact of pattern yardage on the average pattern difficulty score. Are bigger projects also more challenging?

yardage_boxplot <- patterns %>%
  ggplot(aes(x = yardage, y = query)) + 
  geom_boxplot() + 
  labs(title = 'Yardage by search term')

difficulty_boxplot <- patterns %>%
  ggplot(aes(x = difficulty_average, y = query)) + 
  geom_boxplot() +
  labs(title = 'Average difficulty by search term')

yardage_difficulty_scatter <- patterns %>%
  ggplot(aes(x = yardage, y = difficulty_average, color = query)) + 
  geom_point() + 
  facet_wrap(~query) +
  labs(title = 'Difficulty and yardage by search term') +
  theme(legend.position = 'none')

gridExtra::grid.arrange(yardage_boxplot, difficulty_boxplot, yardage_difficulty_scatter, 
                        padding = unit(0.15, "line"),
                        heights = c(2,3),
                        layout_matrix = rbind(c(1,2), c(3)))

There are various other data cleansing (such as outlier removal, looking at those scatter plots) and transformation steps that should be considered here, but for the sake of demonstrating how to fit the models, we'll breeze past those.

Nesting the Data and Fitting the Models

Next, we can nest the data under each query, split into training and testing subsets, fit the a linear regression model, and calculate predictions using map.

set.seed(1234)
nested_fit <- patterns %>%
  select(query, id, yardage, difficulty_average) %>% 
  nest(-query) %>%
  mutate(split = map(data, ~initial_split(., prop = 8/10)),
         train = map(split, ~training(.)),
         test  = map(split, ~testing(.)),
         fit   = map(train, ~lm(difficulty_average ~ yardage, data = .)),
         pred  = map2(.x = fit, .y = test, ~predict(object = .x, newdata = .y)))

There is a lot to unpack here! First, I reduced the data down to just the columns of interest, and then nested them within the query value. You could also think about this as creating a single row for each of the four query values and sticking all the according data into a list-column called data. From there, you can manipulate that list-column using map.

Next, I split each query's data into 80% training, 20% testing using; created the training and testing data; and fit the linear model using lm. Each step uses map to pass a list-column into a function. Finally, I used map2 to pass 2 list-columns into the predict function. You end up with a tibble that looks like this:

query <chr>	data <list>	split <list>	train <list>	test <list>	fit <list>	pred <list>
scarf	<tibble>	<S3: rsplit>	<tibble>	<tibble>	<S3: lm>	<dbl [34]>
hat	<tibble>	<S3: rsplit>	<tibble>	<tibble>	<S3: lm>	<dbl [34]>
sweater	<tibble>	<S3: rsplit>	<tibble>	<tibble>	<S3: lm>	<dbl [34]>
blanket	<tibble>	<S3: rsplit>	<tibble>	<tibble>	<S3: lm>	<dbl [34]>

To view the fit for each model, the function broom::glance can be mapped to each fit and unnested.

nested_fit %>%
  mutate(glanced = map(fit, glance)) %>%
  select(query, glanced) %>%
  unnest(glanced)


## # A tibble: 4 x 13
##   query r.squared adj.r.squared sigma statistic p.value    df logLik  AIC
##   <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> 
## 1 scarf    0.0496        0.0421 0.777      6.68 1.09e-2     1  -151.  307.  
## 2 hat      0.0770        0.0680 0.881      8.51 4.35e-3     1  -133.  273.  
## 3 swea~    0.172         0.165  0.821     28.2  4.43e-7     1  -168.  341.  
## 4 blan~    0.184         0.176  1.09      21.9  9.45e-6     1  -148.  303.  
## # ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>

The adj.r.squared values are quite low, but we do have a small dataset. Next let's inspect the coefficients.

nested_fit %>%
  mutate(tidied = map(fit, tidy)) %>%
  select(query, tidied) %>%
  unnest(tidied)
 
 
## # A tibble: 8 x 6
##   query   term        estimate std.error statistic  p.value
##   <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 scarf   (Intercept) 2.23     0.117         19.0  6.78e-39
## 2 scarf   yardage     0.000510 0.000134       3.81 2.14e- 4
## 3 hat     (Intercept) 2.01     0.128         15.7  8.27e-29
## 4 hat     yardage     0.00299  0.000514       5.81 7.30e- 8
## 5 sweater (Intercept) 2.66     0.128         20.7  6.24e-44
## 6 sweater yardage     0.000578 0.000109       5.31 4.43e- 7
## 7 blanket (Intercept) 2.56     0.176         14.6  3.71e-26
## 8 blanket yardage     0.000295 0.0000630      4.68 9.45e- 6

Test Set Prediction

Despite low R2 for the models, the coefficients are all significant. How good are they at predicting the test set values?

nested_fit_evaluate <- nested_fit %>%
  select(query, test, pred) %>%
  unnest(test, pred)

nested_fit_evaluate %>%
  group_by(query) %>%
  summarise(total_ss = sum((difficulty_average - mean(difficulty_average))^2),
            residual_ss = sum((difficulty_average - pred)^2),
            n = n(),
            .groups = 'drop') %>%
  mutate(r.squared = 1 - (residual_ss/total_ss),
         adj.r.squared = 1 - (((1-r.squared)*(n-1))/(n-1-1)))


## # A tibble: 4 x 6
##   query   total_ss residual_ss     n r.squared adj.r.squared
##   <chr>      <dbl>       <dbl> <int>     <dbl>         <dbl>
## 1 blanket     29.6        18.8    24    0.365        0.336  
## 2 hat         17.3        12.3    25    0.289        0.259  
## 3 scarf       16.1        13.9    32    0.132        0.103  
## 4 sweater     30.0        29.1    34    0.0290      -0.00135

The blanket, hat and scarf models do a pretty good job of predicting the test set values, but sweater is poor.

Why not just fit one big model?

How does nested data linear models compare to one large model where the nesting value is a categorical independent variable? Guess what. It's the same!

set.seed(2345)
split <- initial_split(patterns, prop = 8/10, strata = query)
pattern_train <- training(split)
pattern_test <- testing(split) 

fit <- lm(difficulty_average ~ yardage*query, data = pattern_train)

fit %>% 
  tidy()


## # A tibble: 8 x 5
##   term                   estimate std.error statistic  p.value
##   <chr>                     <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           2.42      0.135        17.9   5.87e-55
## 2 yardage               0.000340  0.0000499     6.80  3.21e-11
## 3 queryhat             -0.0481    0.177        -0.272 7.86e- 1
## 4 queryscarf           -0.0484    0.180        -0.269 7.88e- 1
## 5 querysweater          0.163     0.192         0.849 3.97e- 1
## 6 yardage:queryhat      0.000750  0.000340      2.21  2.79e- 2
## 7 yardage:queryscarf   -0.0000650 0.000132     -0.493 6.22e- 1
## 8 yardage:querysweater  0.000261  0.000127      2.05  4.06e- 2

In the single model, the formula is Y = B0 + B1X1 + B2X2 + B3X1X2 + error,

where Y is difficulty_average, X1 is yardage, X2 is query
B0 is the intercept, B1 is the coefficient for yardage, B2 is the coefficient for query, and B3 is the coefficient for the interaction term

The big ugly formula would be:

Y =
2.42 +
0.000340* yardage +
-0.0481* query(hat) +
-0.0484* query(scarf) +
0.163* query(sweater) +
0.000750* yardage * query(hat) +
-0.000065* yardage * query(scarf) +
0.000261* yardage * query(sweater) +
error

Where query(hat) is 1 when query = "hat" and 0 otherwise. So for hat, we could replace both query(hat) with 1s, and the other query(x) to zeroes and simplify to:

Y = 2.42 + 0.000340* yardage + -0.0481 + 0.000750* yardage

And then move terms: Y = (2.42 -0.0481) + (0.000340 + 0.000750)* yardage

Finally: Y = 2.376 + 0.00109006* yardage

Basically, the summarized coefficients for each query is the addition of the intercept and query term, and the addition of the yardage and interaction term. We get the same results if we run 4 models on the nested data:

nested_fit_test <- pattern_train %>%
  select(query, id, yardage, difficulty_average) %>%
  nest(-query) %>%
  mutate(fit = map(data, ~lm(difficulty_average ~ yardage, data = .)))

nested_fit_test %>%
  mutate(tidied = map(fit, tidy)) %>% 
  select(query, tidied) %>% 
  unnest(tidied)
  
## # A tibble: 8 x 6
##   query   term        estimate std.error statistic  p.value
##   <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 scarf   (Intercept) 2.38     0.105         22.7  1.07e-46
## 2 scarf   yardage     0.000275 0.000107       2.56 1.17e- 2
## 3 hat     (Intercept) 2.38     0.116         20.6  4.45e-38
## 4 hat     yardage     0.00109  0.000341       3.20 1.85e- 3
## 5 sweater (Intercept) 2.59     0.124         20.8  4.71e-44
## 6 sweater yardage     0.000600 0.000107       5.61 1.07e- 7
## 7 blanket (Intercept) 2.42     0.165         14.7  2.39e-26
## 8 blanket yardage     0.000340 0.0000610      5.56 2.34e- 7

So, it's really up to you whether you choose to split your data by some factor and fit multiple models or simply introduce that factor as an interaction term. If you're looking to compare a particular coefficient across factor levels, splitting and fitting multiple may be faster than extracting the terms from a large model.