Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

eBook

$38.49  $50.99 Save 25% Current price is $38.49, Original price is $50.99. You Save 25%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

In the second edition of this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance.

If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.

With this book, you will:

  • Familiarize yourself with the Spark programming model
  • Become comfortable within the Spark ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public data sets
  • Discover which machine learning tools make sense for particular problems
  • Acquire code that can be adapted to many uses

Product Details

ISBN-13: 9781491972908
Publisher: O'Reilly Media, Incorporated
Publication date: 06/12/2017
Sold by: Barnes & Noble
Format: eBook
Pages: 280
File size: 5 MB

About the Author

Sandy Ryza develops algorithms for public transit at Remix. Prior, he was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project. He holds the Brown University computer science department's 2012 Twining award for "Most Chill".


Uri Laserson is an Assistant Professor of Genetics at the Icahn School of Medicine at Mount Sinai, where he develops scalable technology for genomics and immunology using the Hadoop ecosystem.


Sean Owen is Director of Data Science at Cloudera. He is an ApacheSpark committer and PMC member, and was an Apache Mahout committer.


Josh Wills is the Head of Data Engineering at Slack, the founder of the Apache Crunch project, and wrote a tweet about data scientists once.

Table of Contents

Foreword vii

Preface ix

1 Analyzing Big Data 1

The Challenges of Data Science 3

Introducing Apache Spark 4

About This Book 6

2 Introduction to Data Analysis with Scala and Spark 9

Scala for Data Scientists 10

The Spark Programming Model 11

Record Linkage 11

Getting Started: The Spark Shell and SparkContext 13

Bringing Data from the Cluster to the Client 18

Shipping Code from the Client to the Cluster 22

Structuring Data with Tuples and Case Classes 23

Aggregations 28

Creating Histograms 29

Summary Statistics for Continuous Variables 30

Creating Reusable Code for Computing Summary Statistics 31

Simple Variable Selection and Scoring 36

Where to Go from Here 37

3 Recommending Musk and the Audioscrobbler Data Set 39

Data Set 40

The Alternating Least Squares Recommender Algorithm 41

Preparing the Data 44

Building a First Model 46

Spot Checking Recommendations 48

Evaluating Recommendation Quality 50

Computing AUC 51

Hyperparameter Selection 53

Making Recommendations 55

Where to Go from Here 56

4 Predicting Forest Cover with Decision Trees 59

Fast Forward to Regression 59

Vectors and Features 60

Training Examples 61

Decision Trees and Forests 62

Covtype Data Set 65

Preparing the Data 66

A First Decision Tree 67

Decision Tree Hyperparameters 71

Tuning Decision Trees 73

Categorical Features Revisited 75

Random Decision Forests 77

Making Predictions 79

Where to Go from Here 79

5 Anomaly Detection in Network Traffic with K-means Clustering 81

Anomaly Detection 82

K-means Clustering 82

Network Intrusion 83

KDD Cup 1999 Data Set 84

A First Take on Clustering 85

Choosing k 87

Visualization in R 90

Feature Normalization 91

Categorical Variables 94

Using Labels with Entropy 95

Clustering in Action 96

Where to Go from Here 97

6 Understanding Wikipedia with Latent Semantic Analysis 99

The Term-Document Matrix 100

Getting the Data 102

Parsing and Preparing the Data 102

Lemmatization 104

Computing the TF-TDFs 105

Singular Value Decomposition 107

Finding Important Concepts 109

Querying and Scoring with the Low-Dimensional Representation 112

Term-Term Relevance 113

Document-Document Relevance 115

Term-Document Relevance 116

Multiple-Term Queries 117

Where to Go from Here 119

7 Analyzing Co-occurrence Networks with GraphyX 121

The MEDLINE Citation Index: A Network Analysis 122

Getting the Data 123

Parsing XML Documents with Scala's XML Library 125

Analyzing the MeSH Major Topics and Their Co-occurrences 127

Constructing a Co-occurrence Network with GraphX 129

Understanding the Structure of Networks 132

Connected Components 132

Degree Distribution 135

Filtering Out Noisy Edges 138

Processing Edge Triplets 139

Analyzing the Filtered Graph 140

Small-World Networks 142

Cliques and Clustering Coefficients 143

Computing Average Path Length with Pregel 144

Where to Go from Here 149

8 Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data 151

Getting the Data 152

Working with Temporal and Geospatial Data in Spark 153

Temporal Data with Joda Time and NScala Time 153

Geospatial Data with the Esri Geometry API and Spray 155

Exploring the Esri Geometry API 155

Intro to GeoJSON 157

Preparing the New York City Taxi Trip Data 159

Handling Invalid Records at Scale 160

Geospatial Analysis 164

Sessionization in Spark 167

Building Sessions: Secondary Sorts in Spark 168

Where to Go from Here 171

9 Estimating Financial Risk through Monte Carlo Simulation 173

Terminology 174

Methods for Calculating VaR 175

Variance-Covariance 175

Historical Simulation 175

Monte Carlo Simulation 175

Our Model 176

Getting the Data 177

Preprocessing 178

Determining the Factor Weights 181

Sampling 183

The Multivariate Normal Distribution 185

Running the Trials 186

Visualizing the Distribution of Returns 189

Evaluating Our Results 190

Where to Go from Here 192

10 Analyzing Genomics Data and the BDG Project 195

Decoupling Storage from Modeling 196

Ingesting Genomics Data with the ADAM CLI 198

Parquet Format and Columnar Storage 204

Predicting Transcription Factor Binding Sites from ENCODE Data 206

Querying Genotypes from the 1000 Genomes Project 213

Where to Go from Here 214

11 Analyzing Neuroimaging Data with PySpark and Thunder 217

Overview of PySpark 218

PySpark Internals 219

Overview and Installation of the Thunder Library 221

Loading Data with Thunder 222

Thunder Core Data Types 229

Categorizing Neuron Types with Thunder 231

Where to Go from Here 236

A Deeper into Spark 237

B Upcoming MLlib Pipelines API 247

Index 253

From the B&N Reads Blog

Customer Reviews