Practical Data Science with R / Edition 1

Practical Data Science with R / Edition 1

ISBN-10:
1617291560
ISBN-13:
9781617291562
Pub. Date:
04/13/2014
Publisher:
Manning
ISBN-10:
1617291560
ISBN-13:
9781617291562
Pub. Date:
04/13/2014
Publisher:
Manning
Practical Data Science with R / Edition 1

Practical Data Science with R / Edition 1

Paperback

$49.99
Current price is , Original price is $49.99. You
$49.99 
  • SHIP THIS ITEM
    Temporarily Out of Stock Online
  • PICK UP IN STORE

    Your local store may have stock of this item.

  • SHIP THIS ITEM

    Temporarily Out of Stock Online

    Please check back later for updated availability.


Overview

Summary

Practical Data Science with R lives up to its name. It explains basic principles without the theoretical mumbo-jumbo and jumps right to the real use cases you'll face as you collect, curate, and analyze the data crucial to the success of your business. You'll apply the R programming language and statistical analysis techniques to carefully explained examples based in marketing, business intelligence, and decision support.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Book

Business analysts and developers are increasingly collecting, curating, analyzing, and reporting on crucial business data. The R language and its associated tools provide a straightforward way to tackle day-to-day data science tasks without a lot of academic theory or advanced mathematics.

Practical Data Science with R shows you how to apply the R programming language and useful statistical techniques to everyday business situations. Using examples from marketing, business intelligence, and decision support, it shows you how to design experiments (such as A/B tests), build predictive models, and present results to audiences of all levels.

This book is accessible to readers without a background in data science. Some familiarity with basic statistics, R, or another scripting language is assumed.

What's Inside
  • Data science for the business professional
  • Statistical analysis using the R language
  • Project lifecycle, from planning to delivery
  • Numerous instantly familiar use cases
  • Keys to effective data presentations

About the Authors

Nina Zumel and John Mount are cofounders of a San Francisco-based data science consulting firm. Both hold PhDs from Carnegie Mellon and blog on statistics, probability, and computer science at win-vector.com.

Table of Contents
    The data science process
  1. Loading data into R
  2. Exploring data
  3. Managing data
  4. Choosing and evaluating models
  5. Memorization methods
  6. Linear and logistic regression
  7. Unsupervised methods
  8. Exploring advanced methods
  9. Documentation and deployment
  10. Producing effective presentations

Product Details

ISBN-13: 9781617291562
Publisher: Manning
Publication date: 04/13/2014
Edition description: 1st Edition
Pages: 416
Product dimensions: 7.30(w) x 9.10(h) x 1.10(d)

About the Author

Nina Zumel co-founded Win-Vector, a data science consulting firm in San Francisco. She holds a PH.D. in robotics from Carnegie Mellon and was a content developer for EMC's Data Science and Big Data Analytics Training Course. Nina also contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.

John Mount co-founded Win-Vector, a data science consulting firm in San Francisco. He has a Ph.D. in computer science from Carnegie Mellon and over 15 years of applied experience in biotech research, online advertising, price optimization and finance. He contributes to the Win-Vector Blog, which covers topics in statistics, probability, computer science, mathematics and optimization.

Table of Contents

Foreword xv

Preface xvi

Acknowledgments xvii

About this book xviii

About the authors xxv

About the foreword authors xxvi

About the cover illustration xxvii

Part 1 Introduction to data Science 1

1 The data science process 3

1.1 The roles in a data science project 4

Project roles 4

1.2 Stages of a data science project 6

Defining the goal 7

Data collection and management 8

Modeling 10

Model evaluation and critique 12

Presentation and documentation 14

Model deployment and maintenance 15

1.3 Setting expectations 16

Determining lower bounds on model performance 16

2 Starting with R and data 18

2.1 Starting with R 19

Installing R, tools, and examples 20

R programming 20

2.2 Working with data from files 29

Working with well-structured data from files or URLs 29

Using R with less-structured data 34

2.3 Working with relational databases 37

A production-size example 38

3 Exploring data 51

3.1 Using summary statistics to spot problems 53

Typical problems revealed by data summaries 54

3.2 Spotting problems using graphics and visualization 58

Visually checking distributions for a single variable 60

Visually checking relationships between two variables 70

4 Managing data 88

4.1 Cleaning data 88

Domain-specific data cleaning 89

Treating missing values 91

The vtreat package for automatically treating missing variables 95

4.2 Data transformations 98

Normalization 99

Centering and scaling 101

Log transformations for skewed and wide distributions 104

4.3 Sampling for modeling and validation 107

Test and training splits 108

Creating a sample group column 109

Record grouping 110

Data provenance 111

5 Data engineering and data shaping 113

5.1 Data selection 116

Subsetting rows and columns 116

Removing records with incomplete data 121

Ordering rows 124

5.2 Basic data transforms 128

Adding new columns 128

Other simple operations 133

5.3 Aggregating transforms 134

Combining many rows into summary rows 134

5.4 Multitable data transforms 137

Combining two or more ordered data frames quickly 137

Principal methods to combine data from multiple tables 143

5.5 Reshaping transforms 149

Moving data from wide to tall form 149

Moving data from tall to wide form 153

Data coordinates 158

Part 2 Modeling methods 161

6 Choosing and evaluating models 163

6.1 Mapping problems to machine learning tasks 164

Classification problems 165

Scoring problems 166

Grouping: working without known targets 167

Problem-to-method mapping 169

6.2 Evaluating models 170

Overfitting 170

Measures of model performance 174

Evaluating classification models 175

Evaluating scoring models 185

Evaluating probability models 187

6.3 Local interpretable model-agnostic explanations (LIME) for explaining model predictions 195

LIME: Automated sanity checking 197

Walking through LIME: A small example 197

LIME for text classification 204

Training the text classifier 208

Explaining the classifier's predictions 209

7 Linear and logistic regression 215

7.1 Using linear regression 216

Understanding linear regression 217

Building a linear regression model 221

Making predictions 222

Finding relations and extracting advice 228

Reading the model summary and characterizing coefficient qualify 230

Linear regression takeaways 237

7.2 Using logistic regression 237

Understanding logistic regression 237

Building a logistic regression model 242

Making predictions 243

Finding relations and extracting advice from logistic models 248

Reading the model summary and characterizing coefficients 249

Logistic regression takeaways 256

7.3 Regularization 257

An example of quasi-separation 257

The types of regularized regression 262

Regularized regression with glmnet 263

8 Advanced data preparation 274

8.1 The purpose of the vtreat package 275

8.2 KDD and KDD Cup 2009 277

Getting started with KDD Cup 2009 data 278

The bull-in-the-china-shop approach 280

8.3 Basic data preparation for classification 282

The variable score frame 284

Properly using the treatment plan 288

8.4 Advanced data preparation for classification 290

Using mkCrossFrameCExperiment() 290

Building a model 292

8.5 Preparing data for regression modeling 297

8.6 Mastering the vtreat package 299

The vtreat phases 299

Missing values 301

Indicator variables 303

Impact coding 304

The treatment plan 305

The cross-frame 306

9 Unsupervised methods 311

9.1 Cluster analysis 312

Distances 313

Preparing the data 316

Hierarchical clustering with hclust 319

The k-means algorithm 332

Assigning new points to clusters 338

Clustering takeaways 340

9.2 Association rules 340

Overview of association rules 340

The example problem 342

Mining association rules with the arules package 343

Association rule takeaways 351

10 Exploring advanced methods 353

10.1 Tree-based methods 355

A basic decision tree 356

Using bagging to improve prediction 359

Using random forests to further improve prediction 361

Gradient-boosted trees 368

Tree-based model takeaways 376

10.2 Using generalized additive models (GAMs) to learn non-monotone relationships 376

Understanding GAMs 376

A one-dimensional regression example 378

Extracting the non-linear relationships 382

Using GAM on actual data 384

Using GAM for logistic regression 387

GAM takeaways 388

10.3 Solving "inseparable" problems using support vector machines 389

Using an SVM to solve a problem 390

Understanding support vector machines 395

Understanding kernel functions 397

Support vector machine and kernel methods takeaways 399

Part 3 Working in the real world 401

11 Documentation and deployment 403

11.1 Predicting buzz 405

11.2 Using R markdown to produce milestone documentation 406

What is R markdown? 407

Knitr technical details 409

Using knitr to document the Buzz data and produce the model 411

11.3 Using comments and version control for running documentation 414

Writing effective comments 414

Using version control to record history 416

Using version control to explore your project 422

Using version control to share work 424

11.4 Deploying models 428

Deploying demonstrations using Shiny 430

Deploying models as HTTP services 431

Deploying models by export 433

What to take away 435

12 Producing effective presentations 437

12.1 Presenting your results to the project sponsor 439

Summarizing the project's goals 440

Stating the project's results 442

Filling in the details 444

Making recommendations and discussing future work 446

Project sponsor presentation takeaways 446

12.2 Presenting your model to end users 447

Summarizing the project goals 447

Showing how the model fits user workflow 448

Showing how to use the model 450

End user presentation takeaways 452

12.3 Presenting your work to other data scientists 452

Introducing the problem 452

Discussing related work 453

Discussing your approach 454

Discussing results and future work 455

Peer presentation takeaways 457

Appendix A Starting with R and other tools 459

Appendix B Important statistical concepts 484

Appendix C Bibliography 519

Index 523

From the B&N Reads Blog

Customer Reviews