# From Linear Models to Machine Learning

## Preview text

From Linear Models to Machine Learning
a Modern View of Statistical Regression and Classiﬁcation
Norman Matloﬀ University of California, Davis

2

Contents

Preface

xv

1 Setting the Stage

1

1.1 Example: Predicting Bike-Sharing Activity . . . . . . . . . 1

1.2 Example of the Prediction Goal: Bodyfat . . . . . . . . . . 2

1.3 Example of the Description Goal: Who Clicks Web Ads? . . 2

1.4 Optimal Prediction . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 A Note About E(), Samples and Populations . . . . . . . . 4

1.6 Example: Do Baseball Players Gain Weight As They Age? . 5

1.6.1 Prediction vs. Description . . . . . . . . . . . . . . . 6

1.6.2 A First Estimator, Using a Nonparametric Approach 7

1.6.3 A Possibly Better Estimator, Using a Linear Model 9

1.7 Parametric vs. Nonparametric Models . . . . . . . . . . . . 12

1.8 Example: Click-Through Rate . . . . . . . . . . . . . . . . . 13

1.9 Several Predictor Variables . . . . . . . . . . . . . . . . . . 14

1.9.1 Multipredictor Linear Models . . . . . . . . . . . . . 14

1.9.1.1 Estimation of Coeﬃcients . . . . . . . . . . 14

1.9.1.2 The Description Goal . . . . . . . . . . . . 15

1.9.2 Nonparametric Regression Estimation: k-NN . . . . 16

i

ii

CONTENTS

1.9.2.1 Looking at Nearby Points . . . . . . . . . . 16 1.9.3 Measures of Nearness . . . . . . . . . . . . . . . . . 17
1.9.3.1 The k-NN Method . . . . . . . . . . . . . . 17 1.9.4 The Code . . . . . . . . . . . . . . . . . . . . . . . . 17 1.10 After Fitting a Model, How Do We Use It for Prediction? . 18 1.10.1 Parametric Settings . . . . . . . . . . . . . . . . . . 19 1.10.2 Nonparametric Settings . . . . . . . . . . . . . . . . 19 1.11 Underﬁtting, Overﬁtting, Bias and Variance . . . . . . . . . 19 1.11.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.11.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . 21 1.11.3 Linear Model Case . . . . . . . . . . . . . . . . . . . 22
1.11.3.1 The Code . . . . . . . . . . . . . . . . . . . 22 1.11.3.2 Matrix Partitioning . . . . . . . . . . . . . 23 1.11.3.3 Applying the Code . . . . . . . . . . . . . . 24 1.11.4 k-NN Case . . . . . . . . . . . . . . . . . . . . . . . 24 1.11.5 Choosing the Partition Sizes . . . . . . . . . . . . . 25 1.12 Rough Rule of Thumb . . . . . . . . . . . . . . . . . . . . . 25 1.13 Example: Bike-Sharing Data . . . . . . . . . . . . . . . . . 26 1.13.1 Linear Modeling of µ(t) . . . . . . . . . . . . . . . . 27 1.13.2 Nonparametric Analysis . . . . . . . . . . . . . . . . 31 1.14 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . 32 1.14.1 Example: Salaries of Female Programmers and Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.15 Classiﬁcation Techniques . . . . . . . . . . . . . . . . . . . . 36 1.15.1 It’s a Regression Problem! . . . . . . . . . . . . . . . 36 1.15.2 Example: Bike-Sharing Data . . . . . . . . . . . . . 37 1.16 Crucial Advice: Don’t Automate, Participate! . . . . . . . . 40

CONTENTS

iii

1.17 Informal Use of Prediction . . . . . . . . . . . . . . . . . . . 41 1.17.1 Example: Nonnegative Matrix Factorization . . . . . 41
1.18 Some Properties of Conditional Expectation . . . . . . . . . 44 1.18.1 Conditional Expectation As a Random Variable . . 44 1.18.2 The Law of Total Expectation . . . . . . . . . . . . 45 1.18.3 Law of Total Variance . . . . . . . . . . . . . . . . . 46 1.18.4 Tower Property . . . . . . . . . . . . . . . . . . . . . 46 1.18.5 Geometric View . . . . . . . . . . . . . . . . . . . . 47
1.19 Mathematical Complements . . . . . . . . . . . . . . . . . . 47 1.19.1 Indicator Random Variables . . . . . . . . . . . . . . 47 1.19.2 Mean Squared Error of an Estimator . . . . . . . . . 47 1.19.3 µ(t) Minimizes Mean Squared Prediction Error . . . 48 1.19.4 µ(t) Minimizes the Misclassiﬁcation Rate . . . . . . 49
1.20 Computational Complements . . . . . . . . . . . . . . . . . 51 1.20.1 CRAN Packages . . . . . . . . . . . . . . . . . . . . 51 1.20.2 The Functions tapply() and Its Cousins . . . . . . 52 1.20.3 Function Dispatch . . . . . . . . . . . . . . . . . . . 53
1.21 Further Exploration: Data, Code and Math Problems . . . 54

2 Linear Regression Models

57

2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.2 The “Error Term” . . . . . . . . . . . . . . . . . . . . . . . 58

2.3 Random- vs. Fixed-X Cases . . . . . . . . . . . . . . . . . . 59

2.4 Least-Squares Estimation . . . . . . . . . . . . . . . . . . . 60

2.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 60

2.4.2 Matrix Formulations . . . . . . . . . . . . . . . . . . 61

2.4.3 (2.17) in Matrix Terms . . . . . . . . . . . . . . . . . 62

iv

CONTENTS

2.4.4 Using Matrix Operations to Minimize (2.17) . . . . . 62 2.4.5 Models Without an Intercept Term . . . . . . . . . . 63 2.5 A Closer Look at lm() Output . . . . . . . . . . . . . . . . 65 2.5.1 Statistical Inference . . . . . . . . . . . . . . . . . . 66 2.6 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.6.1 Classical . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.6.2 Motivation: the Multivariate Normal Distribution Fam-
ily . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.7 Unbiasedness and Consistency . . . . . . . . . . . . . . . . . 70
2.7.1 β Is Unbiased . . . . . . . . . . . . . . . . . . . . . . 72 2.7.2 Bias As an Issue/Nonissue . . . . . . . . . . . . . . . 72 2.7.3 β Is Consistent . . . . . . . . . . . . . . . . . . . . . 73 2.8 Inference under Homoscedasticity . . . . . . . . . . . . . . . 74 2.8.1 Review: Classical Inference on a Single Mean . . . . 74 2.8.2 Extension to the Regression Case . . . . . . . . . . . 76 2.8.3 Example: Bike-Sharing Data . . . . . . . . . . . . . 79 2.9 Collective Predictive Strength of the X(j) . . . . . . . . . . 81 2.9.1 Basic Properties . . . . . . . . . . . . . . . . . . . . 81 2.9.2 Deﬁnition of R2 . . . . . . . . . . . . . . . . . . . . 82 2.9.3 Bias Issues . . . . . . . . . . . . . . . . . . . . . . . 83 2.9.4 Adjusted-R2 . . . . . . . . . . . . . . . . . . . . . . 85 2.9.5 The “Leaving-One-Out Method” . . . . . . . . . . . 86
2.9.5.1 The Code . . . . . . . . . . . . . . . . . . . 87 2.9.5.2 Example: Bike-Sharing Data . . . . . . . . 90 2.9.5.3 Another Use of loom(): the Jackknife . . . 91 2.9.6 Other Measures . . . . . . . . . . . . . . . . . . . . . 92 2.9.7 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 92

CONTENTS

v

2.10 Signiﬁcance Testing vs. Conﬁdence Intervals . . . . . . . . . 93 2.10.1 Example: Forest Cover Data . . . . . . . . . . . . . 94 2.10.2 Example: Click Through Data . . . . . . . . . . . . 95 2.10.3 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 96
2.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . 96 2.12 Mathematical Complements . . . . . . . . . . . . . . . . . . 97
2.12.1 Covariance Matrices . . . . . . . . . . . . . . . . . . 97 2.12.2 The Multivariate Normal Distribution Family . . . . 98 2.12.3 The Central Limit Theorem . . . . . . . . . . . . . . 99 2.12.4 Details on Models Without a Constant Term . . . . 99 2.12.5 Unbiasedness of the Least-Squares Estimator . . . . 102 2.12.6 Consistency of the Least-Squares Estimator . . . . . 102 2.12.7 Biased Nature of S . . . . . . . . . . . . . . . . . . . 104 2.12.8 µ(X) and ǫ Are Uncorrelated . . . . . . . . . . . . . 104 2.12.9 Asymptotic (p + 1)-Variate Normality of β . . . . . 105 2.12.10 The Geometry of Conditional Expectation . . . . . . 106
2.12.10.1 Random Variables As Inner Product Spaces 107 2.12.10.2 Projections . . . . . . . . . . . . . . . . . . 107 2.12.10.3 Conditional Expectations As Projections . 108 2.13 Computational Complements . . . . . . . . . . . . . . . . . 109 2.13.1 R Functions Relating to the Multivariate Normal Distribution Family . . . . . . . . . . . . . . . . . . . . 109 2.13.1.1 Example: Simulation Computation of a Bi-
variate Normal Quantity . . . . . . . . . . 109 2.13.2 Computation of R-Squared and Adjusted R-Squared 110 2.14 Further Exploration: Data, Code and Math Problems . . . 112

3 The Assumptions in Practice

115

vi

CONTENTS

3.1 Normality Assumption . . . . . . . . . . . . . . . . . . . . . 116 3.2 Independence Assumption — Don’t Overlook It . . . . . . . 117
3.2.1 Estimation of a Single Mean . . . . . . . . . . . . . 117 3.2.2 Inference on Linear Regression Coeﬃcients . . . . . 118 3.2.3 What Can Be Done? . . . . . . . . . . . . . . . . . . 118 3.2.4 Example: MovieLens Data . . . . . . . . . . . . . . 118 3.3 Dropping the Homoscedasticity Assumption . . . . . . . . . 122 3.3.1 Robustness of the Homoscedasticity Assumption . . 123 3.3.2 Weighted Least Squares . . . . . . . . . . . . . . . . 124 3.3.3 A Procedure for Valid Inference . . . . . . . . . . . . 126 3.3.4 The Methodology . . . . . . . . . . . . . . . . . . . . 127 3.3.5 Simulation Test . . . . . . . . . . . . . . . . . . . . . 127 3.3.6 Example: Bike-Sharing Data . . . . . . . . . . . . . 128 3.3.7 Variance-Stabilizing Transformations . . . . . . . . . 128 3.3.8 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 130 3.4 Computational Complements . . . . . . . . . . . . . . . . . 131 3.4.1 The R merge() Function . . . . . . . . . . . . . . . . 131 3.5 Mathematical Complements . . . . . . . . . . . . . . . . . . 132 3.5.1 The Delta Method . . . . . . . . . . . . . . . . . . . 132 3.5.2 Derivation of (3.16) . . . . . . . . . . . . . . . . . . 133 3.5.3 Distortion Due to Transformation . . . . . . . . . . 134 3.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . 135

4 Nonlinear Models

137

4.1 Example: Enzyme Kinetics Model . . . . . . . . . . . . . . 138

4.2 Least-Squares Computation . . . . . . . . . . . . . . . . . . 140

4.2.1 The Gauss-Newton Method . . . . . . . . . . . . . . 140

CONTENTS

vii

4.2.2 Eicker-White Asymptotic Standard Errors . . . . . . 142 4.2.3 Example: Bike Sharing Data . . . . . . . . . . . . . 144 4.2.4 The “Elephant in the Room”: Convergence Issues . 146 4.2.5 Example: Eckerle4 NIST Data . . . . . . . . . . . . 146 4.2.6 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 148 4.3 The Generalized Linear Model (GLM) . . . . . . . . . . . . 148 4.3.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . 148 4.3.2 GLM Computation . . . . . . . . . . . . . . . . . . . 150 4.3.3 R’s glm() Function . . . . . . . . . . . . . . . . . . 151 4.4 GLM: the Logistic Model . . . . . . . . . . . . . . . . . . . 152 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 152 4.4.2 Example: Pima Diabetes Data . . . . . . . . . . . . 156 4.4.3 Interpretation of Coeﬃcients . . . . . . . . . . . . . 156 4.4.4 The predict() Function . . . . . . . . . . . . . . . . . 159 4.4.5 Overall Prediction Accuracy . . . . . . . . . . . . . . 160 4.4.6 Linear Boundary . . . . . . . . . . . . . . . . . . . . 161 4.5 GLM: the Poisson Regression Model . . . . . . . . . . . . . 161 4.6 Mathematical Complements . . . . . . . . . . . . . . . . . . 162 4.6.1 Maximum Likelihood Estimation . . . . . . . . . . . 162

5 Multiclass Classiﬁcation Problems

165

5.1 The Key Equations . . . . . . . . . . . . . . . . . . . . . . . 166

5.2 Estimating the Functions µi(t) . . . . . . . . . . . . . . . . 167

5.3 How Do We Use Models for Prediction? . . . . . . . . . . . 168

5.4 Misclassiﬁcation Costs . . . . . . . . . . . . . . . . . . . . . 168

5.5 One vs. All or All vs. All? . . . . . . . . . . . . . . . . . . . 170

5.5.1 Which Is Better? . . . . . . . . . . . . . . . . . . . . 171

viii

CONTENTS

5.5.2 Example: Vertebrae Data . . . . . . . . . . . . . . . 171 5.5.3 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.5.4 Example: Letter Recognition Data . . . . . . . . . . 172 5.5.5 The Verdict . . . . . . . . . . . . . . . . . . . . . . . 175 5.6 The Classical Approach: Fisher Linear Discriminant Analysis 175 5.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . 175 5.6.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . 176 5.6.3 Example: Vertebrae Data . . . . . . . . . . . . . . . 177
5.6.3.1 LDA Code and Results . . . . . . . . . . . 177 5.6.3.2 Comparison to kNN . . . . . . . . . . . . . 177 5.7 Multinomial Logistic Model . . . . . . . . . . . . . . . . . . 178 5.8 The Issue of “Unbalanced (and Balanced) Data” . . . . . . 180 5.8.1 Why the Concern Regarding Balance? . . . . . . . . 181 5.8.2 A Crucial Sampling Issue . . . . . . . . . . . . . . . 182 5.8.2.1 It All Depends on How We Sample . . . . 182 5.8.2.2 Remedies . . . . . . . . . . . . . . . . . . . 184 5.9 Example: Letter Recognition . . . . . . . . . . . . . . . . . 186 5.10 Mathematical Complements . . . . . . . . . . . . . . . . . . 187 5.10.1 Nonparametric Density Estimation . . . . . . . . . . 187 5.11 Computational Complements . . . . . . . . . . . . . . . . . 187 5.11.1 R Code for OVA and AVA . . . . . . . . . . . . . . . 187 5.11.2 regtools Code . . . . . . . . . . . . . . . . . . . . . . 191 5.12 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . 193 5.13 Further Exploration: Data, Code and Math Problems . . . 193

6 Model Fit: Assessment and Improvement

195

6.1 Aims of This Chapter . . . . . . . . . . . . . . . . . . . . . 195