# Chapter 9 The signal: model functions

suppressPackageStartupMessages(library(tidyverse))
Wage <- ISLR::Wage
NCI60 <- ISLR::NCI60
baseball <- Lahman::Teams %>% tbl_df %>%
select(runs=R, hits=H)
esoph <- as_tibble(esoph) %>%
mutate(agegp = as.character(agegp))
titanic <- na.omit(titanic::titanic_train)

## 9.1 Linear Quantile Regression

The idea here is to model $Q(\tau)=\beta_0(\tau) + \beta_1(\tau) X_1 + \cdots + \beta_p(\tau) X_p,$ where $$Q(\tau)$$ is the $$\tau$$-quantile. In other words, each quantile level gets its own line, and are each fit independently of each other.

Here are the 0.25-, 0.5-, and 0.75-quantile regression lines for the baseball data:

ggplot(baseball, aes(hits, runs)) +
geom_point(alpha=0.1, colour="orange") +
geom_quantile(colour="black") +
theme_bw() +
labs(x="Number of Hits (X)",
y="Number of Runs (Y)")
## Smoothing formula not specified. Using: y ~ x I did this easily with ggplot2, just by adding a layer geom_quantile to my scatterplot, specifying the quantile levels with the quantiles= argument. We could also use the function rq in the quantreg package in R:

(fit_rq <- quantreg::rq(runs ~ hits, data=baseball, tau=c(0.25, 0.5, 0.75)))
## Call:
## quantreg::rq(formula = runs ~ hits, tau = c(0.25, 0.5, 0.75),
##     data = baseball)
##
## Coefficients:
##                tau= 0.25 tau= 0.50  tau= 0.75
## (Intercept) -117.9373297 8.3603239 62.8059701
## hits           0.5531335 0.4931174  0.4925373
##
## Degrees of freedom: 2895 total; 2893 residual

If we were to again focus on the two teams (one with 1000 hits, and one with 1500 hits), we have (by evaluating the above three lines):

predict(fit_rq, newdata=data.frame(hits=c(1000, 1500)))
##   tau= 0.25 tau= 0.50 tau= 0.75
## 1  435.1962  501.4777  555.3433
## 2  711.7629  748.0364  801.6119

So, we could say that the team with 1000 hits:

• is estimated to have a 50% chance to have between 434 and 555 runs;
• has a 25% chance of achieving over 555 runs;
• has a 25% chance of getting less than 434 runs;
• would typically get 501 runs (median);

amongst other things.

### 9.1.1 Exercise

• Get a 95% prediction interval using linear quantile regression, with Y=R (number of runs), X=H (number of hits), when X=1500.
• What about a 95% PI using kNN, going back to the earlier example we did?

### 9.1.2 Problem: Crossing quantiles

Because each quantile is allowed to have its own line, some of these lines might cross, giving an invalid result. Here is an example with the iris data set, fitting the 0.2- and 0.3-quantiles:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
geom_point(alpha=0.25, colour="orange") +
geom_quantile(aes(colour="0.2"), quantiles=0.2) +
geom_quantile(aes(colour="0.3"), quantiles=0.3) +
scale_colour_discrete("Quantile\nLevel") +
theme_bw() +
labs(x="Sepal Length",
y="Sepal Width")
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x fit_iris <- quantreg::rq(Sepal.Width ~ Sepal.Length, data=iris, tau=2:3/10)
b <- coef(fit_iris)
at8 <- round(predict(fit_iris, newdata=data.frame(Sepal.Length=8)), 2)

Quantile estimates of Sepal Width for plants with Sepal Length less than 7.3 are valid, but otherwise, are not. For example, for plants with a Sepal Length of 8, this model predicts 30% of such plants to have a Sepal Width of less than 2.75, but only 20% of such plants should have Sepal Width less than 2.82. This is an illogical statement.

There have been several “adjustments” proposed to ensure that this doesn’t happen (see below), but ultimately, this suggests an inadequacy in the model assumptions. Luckily, this usually only happens at extreme values of the predictor space, and/or for large quantile levels, so is usually not a problem.

• Bondell HD, Reich BJ, Wang H. Noncrossing quantile regression curve estimation. Biometrika. 2010;97(4):825-838.
• Dette H, Volgushev S. Non-crossing non-parametric estimates of quantile curves. J R Stat Soc Ser B Stat Methodol. 2008;70(3):609-627.
• Tokdar ST, Kadane JB. Simultaneous linear quantile regression: a semiparametric Bayesian approach. Bayesian Anal. 2011;6(4):1-22.

### 9.1.3 Problem: Upper quantiles

Estimates of higher quantiles usually become worse for large/small values of $$\tau$$. This is especially true when data are heavy-tailed.

## 9.2 Concepts

• The conditional (population) variance can be estimated by regressing squared residuals as the response. This means that we have to (1) estimate a mean model function using any regression technique, (2) obtain the residuals, then (3) regress the squared residuals against the same predictors as in step (1).
• The tau-quantile (for tau between 0 and 1) is a number that will be exceeded (1-tau)*100% of the time. That is, a proportion of tau outcomes will lie below the tau-quantile (on average).
• The median is a special case of this when tau=0.5.
• The mean is (1) a measure of central tendency of a distribution, and (2) a number useful for calculating totals (for example, multiply by n to get an estimate of the sum of n outcomes).
• Quantile regression is useful for:
• for high-quantiles: giving us a conservative estimate of the outcome – a number that is deliberately larger than expected. For example: a “worst-case scenario” for next month’s expense / a value we can think of as having to pay “at most”.
• for low-quantiles: giving us a liberal estimate of the outcome – a number that is deliberately smaller than expected. For example: a “best-case scenario” for next month’s expense / a value we can think of as having to pay “at least”.
• for medians: giving us an outcome that you can put to a coin flip as to whether or not you’ll “make the cut”. For example: a coin flip would tell you whether or not you’ll exceed the median salary for your new job; a coin flip would tell you whether or not you’ll have to wait longer than the median wait time when sitting in the hospital.
• You can estimate conditional quantiles:
• from a GLM by plugging in the estimated distribution parameter(s) (just the mean in the case of Bernoulli or Poisson) to get the specific distribution, then extracting the desired quantile from the distribution.
• using a local regression method by obtaining an estimate of the univariate quantile that results from the relevant subsample of y occuring near a particular x value.
• A 90% prediction band is the region between the 0.95- and 0.05-quantile model functions (similar calculation can be done for any level of confidence). The corresponding prediction interval for an outcome at a particular x can be obtained by evaluating the two quantile model functions.