Practical Data Science with R / Edition 1 available in Paperback
Practical Data Science with R / Edition 1
- ISBN-10:
- 1617291560
- ISBN-13:
- 9781617291562
- Pub. Date:
- 04/13/2014
- Publisher:
- Manning
Practical Data Science with R / Edition 1
Paperback
Buy New
$49.99Buy Used
$35.89-
SHIP THIS ITEM— Temporarily Out of Stock Online
-
PICK UP IN STORE
Your local store may have stock of this item.
Available within 2 business hours
Temporarily Out of Stock Online
-
SHIP THIS ITEM
Temporarily Out of Stock Online
Please check back later for updated availability.
Overview
Practical Data Science with R lives up to its name. It explains basic principles without the theoretical mumbo-jumbo and jumps right to the real use cases you'll face as you collect, curate, and analyze the data crucial to the success of your business. You'll apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the Book
Business analysts and developers are increasingly collecting, curating, analyzing, and reporting on crucial business data. The R language and its associated tools provide a straightforward way to tackle day-to-day data science tasks without a lot of academic theory or advanced mathematics.
Practical Data Science with R shows you how to apply the R programming language and useful statistical techniques to everyday business situations. Using examples from marketing, business intelligence, and decision support, it shows you how to design experiments (such as A/B tests), build predictive models, and present results to audiences of all levels.
This book is accessible to readers without a background in data science. Some familiarity with basic statistics, R, or another scripting language is assumed.
What's Inside
- Data science for the business professional
- Statistical analysis using the R language
- Project lifecycle, from planning to delivery
- Numerous instantly familiar use cases
- Keys to effective data presentations
About the Authors
Nina Zumel and John Mount are cofounders of a San Francisco-based data science consulting firm. Both hold PhDs from Carnegie Mellon and blog on statistics, probability, and computer science at win-vector.com.
Table of Contents
- The data science process
- Loading data into R
- Exploring data
- Managing data Choosing and evaluating models
- Memorization methods
- Linear and logistic regression
- Unsupervised methods
- Exploring advanced methods Documentation and deployment
- Producing effective presentations
Product Details
ISBN-13: | 9781617291562 |
---|---|
Publisher: | Manning |
Publication date: | 04/13/2014 |
Edition description: | 1st Edition |
Pages: | 416 |
Product dimensions: | 7.30(w) x 9.10(h) x 1.10(d) |
About the Author
John Mount co-founded Win-Vector, a data science consulting firm in San Francisco. He has a Ph.D. in computer science from Carnegie Mellon and over 15 years of applied experience in biotech research, online advertising, price optimization and finance. He contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.
Table of Contents
Foreword xv
Preface xvi
Acknowledgments xvii
About this book xviii
About the authors xxv
About the foreword authors xxvi
About the cover illustration xxvii
Part 1 Introduction to data Science 1
1 The data science process 3
1.1 The roles in a data science project 4
Project roles 4
1.2 Stages of a data science project 6
Defining the goal 7
Data collection and management 8
Modeling 10
Model evaluation and critique 12
Presentation and documentation 14
Model deployment and maintenance 15
1.3 Setting expectations 16
Determining lower bounds on model performance 16
2 Starting with R and data 18
2.1 Starting with R 19
Installing R, tools, and examples 20
R programming 20
2.2 Working with data from files 29
Working with well-structured data from files or URLs 29
Using R with less-structured data 34
2.3 Working with relational databases 37
A production-size example 38
3 Exploring data 51
3.1 Using summary statistics to spot problems 53
Typical problems revealed by data summaries 54
3.2 Spotting problems using graphics and visualization 58
Visually checking distributions for a single variable 60
Visually checking relationships between two variables 70
4 Managing data 88
4.1 Cleaning data 88
Domain-specific data cleaning 89
Treating missing values 91
The vtreat package for automatically treating missing variables 95
4.2 Data transformations 98
Normalization 99
Centering and scaling 101
Log transformations for skewed and wide distributions 104
4.3 Sampling for modeling and validation 107
Test and training splits 108
Creating a sample group column 109
Record grouping 110
Data provenance 111
5 Data engineering and data shaping 113
5.1 Data selection 116
Subsetting rows and columns 116
Removing records with incomplete data 121
Ordering rows 124
5.2 Basic data transforms 128
Adding new columns 128
Other simple operations 133
5.3 Aggregating transforms 134
Combining many rows into summary rows 134
5.4 Multitable data transforms 137
Combining two or more ordered data frames quickly 137
Principal methods to combine data from multiple tables 143
5.5 Reshaping transforms 149
Moving data from wide to tall form 149
Moving data from tall to wide form 153
Data coordinates 158
Part 2 Modeling methods 161
6 Choosing and evaluating models 163
6.1 Mapping problems to machine learning tasks 164
Classification problems 165
Scoring problems 166
Grouping: working without known targets 167
Problem-to-method mapping 169
6.2 Evaluating models 170
Overfitting 170
Measures of model performance 174
Evaluating classification models 175
Evaluating scoring models 185
Evaluating probability models 187
6.3 Local interpretable model-agnostic explanations (LIME) for explaining model predictions 195
LIME: Automated sanity checking 197
Walking through LIME: A small example 197
LIME for text classification 204
Training the text classifier 208
Explaining the classifier's predictions 209
7 Linear and logistic regression 215
7.1 Using linear regression 216
Understanding linear regression 217
Building a linear regression model 221
Making predictions 222
Finding relations and extracting advice 228
Reading the model summary and characterizing coefficient qualify 230
Linear regression takeaways 237
7.2 Using logistic regression 237
Understanding logistic regression 237
Building a logistic regression model 242
Making predictions 243
Finding relations and extracting advice from logistic models 248
Reading the model summary and characterizing coefficients 249
Logistic regression takeaways 256
7.3 Regularization 257
An example of quasi-separation 257
The types of regularized regression 262
Regularized regression with glmnet 263
8 Advanced data preparation 274
8.1 The purpose of the vtreat package 275
8.2 KDD and KDD Cup 2009 277
Getting started with KDD Cup 2009 data 278
The bull-in-the-china-shop approach 280
8.3 Basic data preparation for classification 282
The variable score frame 284
Properly using the treatment plan 288
8.4 Advanced data preparation for classification 290
Using mkCrossFrameCExperiment() 290
Building a model 292
8.5 Preparing data for regression modeling 297
8.6 Mastering the vtreat package 299
The vtreat phases 299
Missing values 301
Indicator variables 303
Impact coding 304
The treatment plan 305
The cross-frame 306
9 Unsupervised methods 311
9.1 Cluster analysis 312
Distances 313
Preparing the data 316
Hierarchical clustering with hclust 319
The k-means algorithm 332
Assigning new points to clusters 338
Clustering takeaways 340
9.2 Association rules 340
Overview of association rules 340
The example problem 342
Mining association rules with the arules package 343
Association rule takeaways 351
10 Exploring advanced methods 353
10.1 Tree-based methods 355
A basic decision tree 356
Using bagging to improve prediction 359
Using random forests to further improve prediction 361
Gradient-boosted trees 368
Tree-based model takeaways 376
10.2 Using generalized additive models (GAMs) to learn non-monotone relationships 376
Understanding GAMs 376
A one-dimensional regression example 378
Extracting the non-linear relationships 382
Using GAM on actual data 384
Using GAM for logistic regression 387
GAM takeaways 388
10.3 Solving "inseparable" problems using support vector machines 389
Using an SVM to solve a problem 390
Understanding support vector machines 395
Understanding kernel functions 397
Support vector machine and kernel methods takeaways 399
Part 3 Working in the real world 401
11 Documentation and deployment 403
11.1 Predicting buzz 405
11.2 Using R markdown to produce milestone documentation 406
What is R markdown? 407
Knitr technical details 409
Using knitr to document the Buzz data and produce the model 411
11.3 Using comments and version control for running documentation 414
Writing effective comments 414
Using version control to record history 416
Using version control to explore your project 422
Using version control to share work 424
11.4 Deploying models 428
Deploying demonstrations using Shiny 430
Deploying models as HTTP services 431
Deploying models by export 433
What to take away 435
12 Producing effective presentations 437
12.1 Presenting your results to the project sponsor 439
Summarizing the project's goals 440
Stating the project's results 442
Filling in the details 444
Making recommendations and discussing future work 446
Project sponsor presentation takeaways 446
12.2 Presenting your model to end users 447
Summarizing the project goals 447
Showing how the model fits user workflow 448
Showing how to use the model 450
End user presentation takeaways 452
12.3 Presenting your work to other data scientists 452
Introducing the problem 452
Discussing related work 453
Discussing your approach 454
Discussing results and future work 455
Peer presentation takeaways 457
Appendix A Starting with R and other tools 459
Appendix B Important statistical concepts 484
Appendix C Bibliography 519
Index 523