During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has le
【作者简介】
Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.
【目录】
1Introduction 2Overview of Supervised Learning 2.1Introduction 2.2Variable Types and Terminology 2.3Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors 2.3.1Linear Models and Least Squares 2.3.2Nearest-Neighbor Methods 2.3.3From Least Squares to Nearest Neighbors 2.4Statistical Decision Theory 2.5Local Methods in High Dimensions 2.6Statistical Models, Supervised Learning and Function Approximation 2.6.1A Statistical Model for the Joint Distribution Pr(X, Y ) 2.6.2Supervised Learning 2.6.3Function Approximation 2.7Structured Regression Models 2.7.1Difficulty of the Problem 2.8Classes of Restricted Estimators 2.8.1Roughness Penalty and Bayesian Methods 2.8.2Kernel Methods and Local Regression 2.8.3Basis Functions and Dictionary Methods 2.9Model Selection and the Bias–Variance Tradeoff Bibliographic Notes Exercises 3Linear Methods for Regression 3.1Introduction 3.2Linear Regression Models and Least Squares 3.2.1Example: Prostate Cancer 3.2.2The Gauss–Markov Theorem 3.2.3Multiple Regression from Simple Univariate Regression 3.2.4Multiple Outputs 3.3Subset Selection 3.3.1Best-Subset Selection 3.3.2Forward- and Backward-Stepwise Selection 3.3.3Forward-Stagewise Regression 3.3.4Prostate Cancer Data Example (Continued) 3.4Shrinkage Methods 3.4.1Ridge Regression 3.4.2The Lasso 3.4.3Discussion: Subset Selection, Ridge Regression and the Lasso 3.4.4Least Angle Regression 3.5Methods Using Derived Input Directions 3.5.1Principal Components Regression 3.5.2Partial Least Squares 3.6Discussion: A Comparison of the Selection and Shrinkage Methods 3.7Multiple Outcome Shrinkage and Selection 3.8More on the Lasso and Related Path Algorithms 3.8.1Incremental Forward Stagewise Regression 3.8.2Piecewise-Linear Path Algorithms 3.8.3The Dantzig Selector 3.8.4The Grouped Lasso 3.8.5Further Properties of the Lasso 3.8.6Pathwise Coordinate Optimization 3.9Computational Considerations Bibliographic Notes Exercises
4Linear Methods for Classification 4.1Introduction 4.2Linear Regression of an Indicator Matrix 4.3Linear Discriminant Analysis 4.3.1Regularized Discriminant Analysis 4.3.2Computations for LDA 4.3.3Reduced-Rank Linear Discriminant Analysis 4.4Logistic Regression 4.4.1Fitting Logistic Regression Models 4.4.2Example: South African Heart Disease 4.4.3Quadratic Approximations and Inference 4.4.4L1 Regularized Logistic Regression 4.4.5Logistic Regression or LDA? 4.5Separating Hyperplanes 4.5.1Rosenblatt’s Perceptron Learning Algorithm . 4.5.2Optimal Separating Hyperplanes Bibliographic Notes Exercises 5Basis Expansions and Regularization 5.1Introduction 5.2Piecewise Polynomials and Splines 5.2.1Natural Cubic Splines 5.2.2Example: South African Heart Disease (Continued) 5.2.3Example: Phoneme Recognition 5.3Filtering and Feature Extraction 5.4Smoothing Splines 5.4.1Degrees of Freedom and Smoother Matrices 5.5Automatic Selection of the Smoothing Parameters 5.5.1Fixing the Degrees of Freedom 5.5.2The Bias–Variance Tradeoff 5.6Nonparametric Logistic Regression 5.7Multidimensional Splines 5.8Regularization and Reproducing Kernel Hilbert Spaces 5.8.1Spaces of Functions Generated by Kernels 5.8.2Examples of RKHS 5.9Wavelet Smoothing 5.9.1Wavelet Bases and the Wavelet Transform 5.9.2Adaptive Wavelet Filtering Bibliographic Notes Exercises Appendix: Computational Considerations for Splines Appendix: B-splines Appendix: Computations for Smoothing Splines
6Kernel Smoothing Methods 6.1One-Dimensional Kernel Smoothers 6.1.1Local Linear Regression 6.1.2Local Polynomial Regression 6.2Selecting the Width of the Kernel 6.3Local Regression in IRp 6.4Structured Local Regression Models in IRp 6.4.1Structured Kernels 6.4.2Structured Regression Functions 6.5Local Likelihood and Other Models 6.6Kernel Density Estimation and Classification 6.6.1Kernel Density Estimation 6.6.2Kernel Density Classification 6.6.3The Naive Bayes Classifier 6.7Radial Basis Functions and Kernels 6.8Mixture Models for Density Estimation and Classification 6.9Computational Considerations Bibliographic Notes Exercises 7Model Assessment and Selection 7.1Introduction 7.2Bias, Variance and Model Complexity 7.3The Bias–Variance Decomposition223 7.3.1Example: Bias–Variance Tradeoff 7.4Optimism of the Training Error Rate 7.5Estimates of In-Sample Prediction Error 7.6The Effective Number of Parameters 7.7The Bayesian Approach and BIC 7.8Minimum Description Length 7.9Vapnik–Chervonenkis Dimension 7.9.1Example (Continued) 7.10Cross-Validation 7.10.1K-Fold Cross-Validation 7.10.2The Wrong and Right Way to Do Cross-validation 7.10.3Does Cross-Validation Really Work? 7.11Bootstrap Methods 7.11.1Example (Continued) 7.12Conditional or Expected Test Error? Bibliographic Notes Exercises 8Model Inference and Averaging 8.1Introduction 8.2The Bootstrap and Maximum Likelihood Methods 8.2.1A Smoothing Example 8.2.2Maximum Likelihood Inference 8.2.3Bootstrap versus Maximum Likelihood 8.3Bayesian Methods 8.4Relationship Between the Bootstrap and Bayesian Inference 8.5The EM Algorithm 8.5.1Two-Component Mixture Model 8.5.2The EM Algorithm in General 8.5.3EM as a Maximization–Maximization Procedure 8.6MCMC for Sampling from the Posterior 8.7Bagging 8.7.1Example: Trees with Simulated Data 8.8Model Averaging and Stacking 8.9Stochastic Search: Bumping Bibliographic Notes Exercises 9Additive Models, Trees, and Related Methods 9.1Generalized Additive Models 9.1.1Fitting Additive Models 9.1.2Example: Additive Logistic Regression 9.1.3Summary 9.2Tree-Based Methods 9.2.1Background 9.2.2Regression Trees 9.2.3Classification Trees 9.2.4Other Issues 9.2.5Spam Example (Continued) 9.3PRIM: Bump Hunting 9.3.1Spam Example (Continued) 9.4MARS: Multivariate Adaptive Regression Splines 9.4.1Spam Example (Continued) 9.4.2Example (Simulated Data) 9.4.3Other Issues 9.5Hierarchical Mixtures of Experts 9.6Missing Data 9.7Computational Considerations Bibliographic Notes Exercises 10Boosting and Additive Trees 10.1Boosting Methods 10.1.1Outline of This Chapter 10.2Boosting Fits an Additive Model 10.3Forward Stagewise Additive Modeling 10.4Exponential Loss and AdaBoost 10.5Why Exponential Loss? 10.6Loss Functions and Robustness 10.7“Off-the-Shelf” Procedures for Data Mining 10.8Example: Spam Data 10.9Boosting Trees 10.10Numerical Optimization via Gradient Boosting 10.10.1Steepest Descent 10.10.2Gradient Boosting 10.10.3Implementations of Gradient Boosting 10.11Right-Sized Trees for Boosting 10.12Regularization 10.12.1Shrinkage 10.12.2Subsampling 10.13Interpretation 10.13.1Relative Importance of Predictor Variables 10.13.2Partial Dependence Plots 10.14Illustrations 10.14.1California Housing 10.14.2New Zealand Fish 10.14.3Demographics Data Bibliographic Notes Exercises 11Neural Networks 11.1Introduction 11.2Projection Pursuit Regression 11.3Neural Networks 11.4Fitting Neural Networks 11.5Some Issues in Training Neural Networks 11.5.1Starting Values 11.5.2Overfitting 11.5.3Scaling of the Inputs 11.5.4Number of Hidden Units and Layers 11.5.5Multiple Minima 11.6Example: Simulated Data 11.7Example: ZIP Code Data 11.8Discussion 11.9Bayesian Neural Nets and the NIPS 2003 Challenge 11.9.1Bayes, Boosting and Bagging 11.9.2Performance Comparisons 11.10Computational Considerations Bibliographic Notes Exercises 12Support Vector Machines and Flexible Discriminants 12.1Introduction 12.2The Support Vector Classifier 12.2.1Computing the Support Vector Classifier 12.2.2Mixture Example (Continued) 12.3Support Vector Machines and Kernels 12.3.1Computing the SVM for Classification 12.3.2The SVM as a Penalization Method 12.3.3Function Estimation and Reproducing Kernels 12.3.4SVMs and the Curse of Dimensionality 12.3.5A Path Algorithm for the SVM Classifier 12.3.6Support Vector Machines for Regression 12.3.7Regression and Kernels 12.3.8Discussion 12.4Generalizing Linear Discriminant Analysis 12.5Flexible Discriminant Analysis 12.5.1Computing the FDA Estimates 12.6Penalized Discriminant Analysis 12.7Mixture Discriminant Analysis 12.7.1Example: Waveform Data Bibliographic Notes Exercises 13 Prototype Methods and Nearest-Neighbors 13.1Introduction 13.2Prototype Methods 13.2.1K-means Clustering 13.2.2Learning Vector Quantization 13.2.3Gaussian Mixtures 13.3k-Nearest-Neighbor Classifiers 13.3.1Example: A Comparative Study 13.3.2Example: k-Nearest-Neighbors and Image Scene Classification 13.3.3Invariant Metrics and Tangent Distance 13.4Adaptive Nearest-Neighbor Methods 13.4.1Example 13.4.2Global Dimension Reduction for Nearest-Neighbors 13.5Computational Considerations Bibliographic Notes Exercises
14 Unsupervised Learning 14.1Introduction 14.2Association Rules 14.2.1Market Basket Analysis 14.2.2The Apriori Algorithm 14.2.3Example: Market Basket Analysis 14.2.4Unsupervised as Supervised Learning 14.2.5Generalized Association Rules 14.2.6Choice of Supervised Learning Method 14.2.7Example: Market Basket Analysis (Continued) 14.3Cluster Analysis 14.3.1Proximity Matrices 14.3.2Dissimilarities Based on Attributes 14.3.3Object Dissimilarity 14.3.4Clustering Algorithms 14.3.5Combinatorial Algorithms 14.3.6K-means 14.3.7Gaussian Mixtures as Soft K-means Clustering 14.3.8Example: Human Tumor Microarray Data 14.3.9Vector Quantization 14.3.10 K-medoids 14.3.11 Practical Issues 14.3.12 Hierarchical Clustering 14.4Self-Organizing Maps 14.5Principal Components, Curves and Surfaces 14.5.1Principal Components 14.5.2Principal Curves and Surfaces 14.5.3Spectral Clustering 14.5.4Kernel Principal Components 14.5.5Sparse Principal Components 14.6Non-negative Matrix Factorization 14.6.1Archetypal Analysis 14.7Independent Component Analysis and Exploratory Projection Pursuit 14.7.1Latent Variables and Factor Analysis 14.7.2Independent Component Analysis 14.7.3Exploratory Projection Pursuit 14.7.4A Direct Approach to ICA 14.8Multidimensional Scaling 14.9Nonlinear Dimension Reduction and Local Multidimensional Scaling 14.10 The Google PageRank Algorithm Bibliographic Notes Exercises
15Random Forests 15.1Introduction 15.2Definition of Random Forests 15.3Details of Random Forests 15.3.1Out of Bag Samples 15.3.2Variable Importance 15.3.3Proximity Plots 15.3.4Random Forests and Overfitting 15.4Analysis of Random Forests 15.4.1Variance and the De-Correlation Effect 15.4.2Bias 15.4.3Adaptive Nearest Neighbors Bibliographic Notes Exercises 16Ensemble Learning 16.1Introduction 16.2Boosting and Regularization Paths 16.2.1Penalized Regression 16.2.2The “Bet on Sparsity” Principle 16.2.3Regularization Paths, Over-fitting and Margins 16.3Learning Ensembles 16.3.1Learning a Good Ensemble 16.3.2Rule Ensembles Bibliographic Notes Exercises 17Undirected Graphical Models 17.1Introduction 17.2Markov Graphs and Their Properties 17.3Undirected Graphical Models for Continuous Variables 17.3.1Estimation of the Parameters when the Graph Structure is Known 17.3.2Estimation of the Graph Structure 17.4Undirected Graphical Models for Discrete Variables 17.4.1Estimation of the Parameters when the Graph Structure is Known 17.4.2Hidden Nodes 17.4.3Estimation of the Graph Structure 17.4.4Restricted Boltzmann Machines Exercises 18High-Dimensional Problems: p ≫ N 18.1When p is Much Bigger than N 18.2Diagonal Linear Discriminant Analysis and Nearest Shrunken Centroids 18.3Linear Classifiers with Quadratic Regularization 18.3.1Regularized Discriminant Analysis 18.3.2Logistic Regression with Quadratic Regularization 18.3.3The Support Vector Classifier 18.3.4Feature Selection 18.3.5Computational Shortcuts When p ≫ N 18.4Linear Classifiers with L1 Regularization 18.4.1Application of Lasso to Protein Mass Spectroscopy 18.4.2The Fused Lasso for Functional Data 18.5Classification When Features are Unavailable 18.5.1Example: String Kernels and Protein Classification 18.5.2Classification and Other Models Using Inner-Product Kernels and Pairwise Distances . 18.5.3Example: Abstracts Classification 18.6High-Dimensional Regression: Supervised Principal Components 18.6.1Connection to Latent-Variable Modeling 18.6.2Relationship with Partial Least Squares 18.6.3Pre-Conditioning for Feature Selection 18.7Feature Assessment and the Multiple-Testing Problem 18.7.1The False Discovery Rate 18.7.2Asymmetric Cutpoints and the SAM Procedure 18.7.3A Bayesian Interpretation of the FDR 18.8Bibliographic Notes Exercises
以下为对购买帮助不大的评价