If you have a dataset you'd like to break into chunks and fit a model for each chunk, and you don't want to copy pasta the same code over and over, or use a clunky
for loop, read on. You can do it in less than 10 lines of code using
Getting Some Data
Let's begin by generating some data using the
ravelRy package (shameless self-plug), which makes knitting and crochet patterns available via Ravelry.com's API. You can follow the steps here for installing the package, creating a Ravelry developer account, and authenticating via the package.
To create a dataset, let's pass four different search terms to the
ravelRy::search_patterns function using
blanket. For each, let's restrict to crochet patterns, sort by popularity, and retrieve the top 200. The returned
ids can then be used to get pattern details using the function
queries <- data.frame(query = c('scarf', 'hat', 'sweater', 'blanket')) %>% mutate(results = map(query, ~ search_patterns(query = ., craft = 'crochet', page_size = 200, sort = 'popularity'))) ids <- queries %>% unnest(results) pattern_details <- get_patterns(ids = ids$id) patterns <- ids %>% select(query, id) %>% left_join(pattern_details, by = 'id') %>% mutate(yardage = as.numeric(yardage)) %>% filter(!is.na(yardage), !is.na(difficulty_average))
Looking at The Data
The relationship of interest here is the impact of pattern yardage on the average pattern difficulty score. Are bigger projects also more challenging?
yardage_boxplot <- patterns %>% ggplot(aes(x = yardage, y = query)) + geom_boxplot() + labs(title = 'Yardage by search term') difficulty_boxplot <- patterns %>% ggplot(aes(x = difficulty_average, y = query)) + geom_boxplot() + labs(title = 'Average difficulty by search term') yardage_difficulty_scatter <- patterns %>% ggplot(aes(x = yardage, y = difficulty_average, color = query)) + geom_point() + facet_wrap(~query) + labs(title = 'Difficulty and yardage by search term') + theme(legend.position = 'none') gridExtra::grid.arrange(yardage_boxplot, difficulty_boxplot, yardage_difficulty_scatter, padding = unit(0.15, "line"), heights = c(2,3), layout_matrix = rbind(c(1,2), c(3)))
There are various other data cleansing (such as outlier removal, looking at those scatter plots) and transformation steps that should be considered here, but for the sake of demonstrating how to fit the models, we'll breeze past those.
Nesting the Data and Fitting the Models
Next, we can nest the data under each query, split into training and testing subsets, fit the a linear regression model, and calculate predictions using
set.seed(1234) nested_fit <- patterns %>% select(query, id, yardage, difficulty_average) %>% nest(-query) %>% mutate(split = map(data, ~initial_split(., prop = 8/10)), train = map(split, ~training(.)), test = map(split, ~testing(.)), fit = map(train, ~lm(difficulty_average ~ yardage, data = .)), pred = map2(.x = fit, .y = test, ~predict(object = .x, newdata = .y)))
There is a lot to unpack here! First, I reduced the data down to just the columns of interest, and then nested them within the
query value. You could also think about this as creating a single row for each of the four
query values and sticking all the according data into a list-column called
data. From there, you can manipulate that list-column using
Next, I split each query's
data into 80% training, 20% testing using; created the training and testing data; and fit the linear model using
lm. Each step uses
map to pass a list-column into a function. Finally, I used
map2 to pass 2 list-columns into the
predict function. You end up with a tibble that looks like this:
|scarf||<tibble>||<S3: rsplit>||<tibble>||<tibble>||<S3: lm>||<dbl >|
|hat||<tibble>||<S3: rsplit>||<tibble>||<tibble>||<S3: lm>||<dbl >|
|sweater||<tibble>||<S3: rsplit>||<tibble>||<tibble>||<S3: lm>||<dbl >|
|blanket||<tibble>||<S3: rsplit>||<tibble>||<tibble>||<S3: lm>||<dbl >|
To view the fit for each model, the function
broom::glance can be mapped to each
fit and unnested.
nested_fit %>% mutate(glanced = map(fit, glance)) %>% select(query, glanced) %>% unnest(glanced) ## # A tibble: 4 x 13 ## query r.squared adj.r.squared sigma statistic p.value df logLik AIC ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 scarf 0.0496 0.0421 0.777 6.68 1.09e-2 1 -151. 307. ## 2 hat 0.0770 0.0680 0.881 8.51 4.35e-3 1 -133. 273. ## 3 swea~ 0.172 0.165 0.821 28.2 4.43e-7 1 -168. 341. ## 4 blan~ 0.184 0.176 1.09 21.9 9.45e-6 1 -148. 303. ## # ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>, nobs <int>
adj.r.squared values are quite low, but we do have a small dataset. Next let's inspect the coefficients.
nested_fit %>% mutate(tidied = map(fit, tidy)) %>% select(query, tidied) %>% unnest(tidied) ## # A tibble: 8 x 6 ## query term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 scarf (Intercept) 2.23 0.117 19.0 6.78e-39 ## 2 scarf yardage 0.000510 0.000134 3.81 2.14e- 4 ## 3 hat (Intercept) 2.01 0.128 15.7 8.27e-29 ## 4 hat yardage 0.00299 0.000514 5.81 7.30e- 8 ## 5 sweater (Intercept) 2.66 0.128 20.7 6.24e-44 ## 6 sweater yardage 0.000578 0.000109 5.31 4.43e- 7 ## 7 blanket (Intercept) 2.56 0.176 14.6 3.71e-26 ## 8 blanket yardage 0.000295 0.0000630 4.68 9.45e- 6
Test Set Prediction
Despite low R2 for the models, the coefficients are all significant. How good are they at predicting the test set values?
nested_fit_evaluate <- nested_fit %>% select(query, test, pred) %>% unnest(test, pred) nested_fit_evaluate %>% group_by(query) %>% summarise(total_ss = sum((difficulty_average - mean(difficulty_average))^2), residual_ss = sum((difficulty_average - pred)^2), n = n(), .groups = 'drop') %>% mutate(r.squared = 1 - (residual_ss/total_ss), adj.r.squared = 1 - (((1-r.squared)*(n-1))/(n-1-1))) ## # A tibble: 4 x 6 ## query total_ss residual_ss n r.squared adj.r.squared ## <chr> <dbl> <dbl> <int> <dbl> <dbl> ## 1 blanket 29.6 18.8 24 0.365 0.336 ## 2 hat 17.3 12.3 25 0.289 0.259 ## 3 scarf 16.1 13.9 32 0.132 0.103 ## 4 sweater 30.0 29.1 34 0.0290 -0.00135
scarf models do a pretty good job of predicting the test set values, but
sweater is poor.
Why not just fit one big model?
How does nested data linear models compare to one large model where the nesting value is a categorical independent variable? Guess what. It's the same!
set.seed(2345) split <- initial_split(patterns, prop = 8/10, strata = query) pattern_train <- training(split) pattern_test <- testing(split) fit <- lm(difficulty_average ~ yardage*query, data = pattern_train) fit %>% tidy() ## # A tibble: 8 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2.42 0.135 17.9 5.87e-55 ## 2 yardage 0.000340 0.0000499 6.80 3.21e-11 ## 3 queryhat -0.0481 0.177 -0.272 7.86e- 1 ## 4 queryscarf -0.0484 0.180 -0.269 7.88e- 1 ## 5 querysweater 0.163 0.192 0.849 3.97e- 1 ## 6 yardage:queryhat 0.000750 0.000340 2.21 2.79e- 2 ## 7 yardage:queryscarf -0.0000650 0.000132 -0.493 6.22e- 1 ## 8 yardage:querysweater 0.000261 0.000127 2.05 4.06e- 2
In the single model, the formula is Y = B0 + B1X1 + B2X2 + B3X1X2 + error,
- where Y is
difficulty_average, X1 is
yardage, X2 is
- B0 is the intercept, B1 is the coefficient for yardage, B2 is the coefficient for query, and B3 is the coefficient for the interaction term
The big ugly formula would be:
query(hat) is 1 when
query = "hat" and 0 otherwise. So for hat, we could replace both
query(hat) with 1s, and the other
query(x) to zeroes and simplify to:
Y = 2.42 + 0.000340*
yardage + -0.0481 + 0.000750*
And then move terms: Y = (2.42 -0.0481) + (0.000340 + 0.000750)*
Finally: Y = 2.376 + 0.00109006*
Basically, the summarized coefficients for each query is the addition of the intercept and query term, and the addition of the yardage and interaction term. We get the same results if we run 4 models on the nested data:
nested_fit_test <- pattern_train %>% select(query, id, yardage, difficulty_average) %>% nest(-query) %>% mutate(fit = map(data, ~lm(difficulty_average ~ yardage, data = .))) nested_fit_test %>% mutate(tidied = map(fit, tidy)) %>% select(query, tidied) %>% unnest(tidied) ## # A tibble: 8 x 6 ## query term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 scarf (Intercept) 2.38 0.105 22.7 1.07e-46 ## 2 scarf yardage 0.000275 0.000107 2.56 1.17e- 2 ## 3 hat (Intercept) 2.38 0.116 20.6 4.45e-38 ## 4 hat yardage 0.00109 0.000341 3.20 1.85e- 3 ## 5 sweater (Intercept) 2.59 0.124 20.8 4.71e-44 ## 6 sweater yardage 0.000600 0.000107 5.61 1.07e- 7 ## 7 blanket (Intercept) 2.42 0.165 14.7 2.39e-26 ## 8 blanket yardage 0.000340 0.0000610 5.56 2.34e- 7
So, it's really up to you whether you choose to split your data by some factor and fit multiple models or simply introduce that factor as an interaction term. If you're looking to compare a particular coefficient across factor levels, splitting and fitting multiple may be faster than extracting the terms from a large model.