Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning

Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning

by Valliappa Lakshmanan
Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning

Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning

by Valliappa Lakshmanan

Paperback(2nd ed.)

$79.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP.

Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way.

You'll learn how to:

  • Employ best practices in building highly scalable data and ML pipelines on Google Cloud
  • Automate and schedule data ingest using Cloud Run
  • Create and populate a dashboard in Data Studio
  • Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
  • Conduct interactive data exploration with BigQuery
  • Create a Bayesian model with Spark on Cloud Dataproc
  • Forecast time series and do anomaly detection with BigQuery ML
  • Aggregate within time windows with Dataflow
  • Train explainable machine learning models with Vertex AI
  • Operationalize ML with Vertex AI Pipelines

Product Details

ISBN-13: 9781098118952
Publisher: O'Reilly Media, Incorporated
Publication date: 05/03/2022
Edition description: 2nd ed.
Pages: 459
Product dimensions: 7.00(w) x 9.19(h) x 0.93(d)

About the Author

Valliappa (Lak) Lakshmanan is the director of analytics and AI solutions at Google Cloud, where he leads a team building cross-industry solutions to business problems. His mission is to democratize machine learning so that it can be done by anyone anywhere. Lak is the author or coauthor of Practical Machine Learning for Computer Vision, Machine Learning Design Patterns, Data Governance The Definitive Guide, Google BigQuery The Definitive Guide, and Data Science on the Google Cloud Platform.

Table of Contents

Preface xi

1 Making Better Decisions Based on Data 1

Many Similar Decisions 4

The Role of Data Scientists 5

Scrappy Environment 7

Full Stack Cloud Data Scientists 8

Collaboration 9

Best Practices 10

Simple to Complex Solutions 11

Cloud Computing 11

Serverless 12

A Probabilistic Decision 13

Probabilistic Approach 15

Probability Density Function 16

Cumulative Distribution Function 17

Choices Made 18

Choosing Cloud 19

Not a Reference Book 19

Getting Started with the Code 20

Agile Architecture for Data Science on Google Cloud 22

What Is Agile Architecture? 23

No-Code, Low-Code 23

Use Managed Services 24

Summary 25

Suggested Resources 26

2 Ingesting Data into the Cloud 29

Airline On-Time Performance Data 30

Knowability 31

Causality 31

Training-Serving Skew 32

Downloading Data 33

Hub-and-Spoke Architecture 34

Dataset Fields 35

Separation of Compute and Storage 37

Scaling Up 39

Scaling Out with Sharded Data 41

Scaling Out with Data-in-Place 43

Ingesting Data 46

Reverse Engineering a Web Form 46

Dataset Download 48

Exploration and Cleanup 50

Uploading Data to Google Cloud Storage 51

Loading Data into Google BigQuery 55

Advantages of a Serverless Columnar Database 55

Staging on Cloud Storage 57

Access Control 57

Ingesting CSV Files 61

Partitioning 62

Scheduling Monthly Downloads 63

Ingesting in Python 65

Cloud Run 71

Securing Cloud Run 72

Deploying and Invoking Cloud Run 74

Scheduling Cloud Run 75

Summary 76

Code Break 77

Suggested Resources 78

3 Creating Compelling Dashboards 81

Explain Your Model with Dashboards 83

Why Build a Dashboard First? 84

Accuracy, Honesty, and Good Design 86

Loading Data into Cloud SQL 88

Create a Google Cloud SQL Instance 89

Create Table of Data 92

Interacting with the Database 95

Querying Using BigQuery 96

Schema Exploration 96

Using Preview 97

Using Table Explorer 99

Creating BigQuery View 100

Building Our First Model 101

Contingency Table 101

Threshold Optimization 103

Building a Dashboard 106

Getting Started with Data Studio 107

Creating Charts 109

Adding End-User Controls 110

Showing Proportions with a Pie Chart 112

Explaining a Contingency Table 117

Modern Business Intelligence 119

Digitization 119

Natural Language Queries 120

Connected Sheets 122

Summary 123

Suggested Resources 123

4 Streaming Data: Publication and Ingest with Pub/Sub and Dataflow 125

Designing the Event Feed 126

Transformations Needed 127

Architecture 128

Getting Airport Information 129

Sharing Data 132

Time Correction 133

Apache Beam/Cloud Dataflow 135

Parsing Airports Data 136

Adding Time Zone Information 139

Converting Times to UTC 141

Correcting Dates 144

Creating Events 146

Reading and Writing to the Cloud 148

Running the Pipeline in the Cloud 150

Publishing an Event Stream to Cloud Pub/Sub 153

Speed-Up Factor 154

Get Records to Publish 155

How Many Topics? 156

Iterating Through Records 157

Building a Batch of Events 158

Publishing a Batch of Events 159

Real-Time Stream Processing 160

Streaming in Dataflow 160

Windowing a Pipeline 162

Streaming Aggregation 162

Using Event Timestamps 165

Executing the Stream Processing 166

Analyzing Streaming Data in BigQuery 168

Real-Time Dashboard 169

Summary 170

Suggested Resources 171

5 Interactive Data Exploration with Vertex AI Workbench 173

Exploratory Data Analysis 174

Exploration with SQL 177

Reading a Query Explanation 179

Exploratory Data Analysis in Vertex AI Workbench 184

Jupyter Notebooks 185

Creating a Notebook 186

Jupyter Commands 188

Installing Packages 188

Jupyter Magic for Google Cloud 189

Exploring Arrival Delays 190

Basic Statistics 191

Plotting Distributions 191

Quality Control 194

Arrival Delay Conditioned on Departure Delay 199

Evaluating the Model 204

Random Shuffling 204

Splitting by Date 205

Training and Testing 206

Summary 210

Suggested Resources 210

6 Bayesian Classifier with Apache Spark on Cloud Dataproc 211

MapReduce and the Hadoop Ecosystem 211

How MapReduce Works 212

Apache Hadoop 214

Google Cloud Dataproc 214

Need for Higher-Level Tools 216

Jobs, Not Clusters 217

Preinstalling Software 219

Quantization Using Spark SQL 221

JupyterLab on Cloud Dataproc 222

Independence Check Using BigQuery 223

Spark SQL in JupyterLab 225

Histogram Equalization 227

Bayesian Classification 231

Bayes in Each Bin 231

Evaluating the Model 233

Dynamically Resizing Clusters 234

Comparing to Single Threshold Model 235

Orchestration 238

Submitting a Spark Job 238

Workflow Template 238

Cloud Composer 239

Autoscaling 240

Serverless Spark 241

Summary 242

Suggested Resources 243

7 Logistic Regression Using Spark ML 245

Logistic Regression 246

How Logistic Regression Works 246

Spark ML Library 249

Getting Started with Spark Machine Learning 250

Spark Logistic Regression 251

Creating a Training Dataset 252

Training the Model 256

Predicting Using the Model 259

Evaluating a Model 260

Feature Engineering 263

Experimental Framework 263

Feature Selection 267

Feature Transformations 271

Feature Creation 274

Categorical Variables 278

Repeatable, Real Time 280

Summary 281

Suggested Resources 282

8 Machine Learning with BigQuery ML 283

Logistic Regression 283

Presplit Data 285

Interrogating the Model 286

Evaluating the Model 287

Scale and Simplicity 289

Nonlinear Machine Learning 290

XGBoost 290

Hyperparameter Tuning 292

Vertex AI AutoML Tables 294

Time Window Features 296

Taxi-Out Time 296

Compounding Delays 298

Causality 299

Time Features 300

Departure Hour 300

Transform Clause 302

Categorical Variable 303

Feature Cross 303

Summary 305

Suggested Resources 306

9 Machine Learning with TensorFlow in Vertex AI 309

Toward More Complex Models 310

Preparing BigQuery Data for TensorFlow 314

Reading Data into TensorFlow 315

Training and Evaluation in Keras 317

Model Function 317

Features 318

Inputs 320

Training the Keras Model 320

Saving and Exporting 322

Deep Neural Network 322

Wide-and-Deep Model in Keras 323

Representing Air Traffic Corridors 323

Bucketing 324

Feature Crossing 325

Wide-and-Deep Classifier 326

Deploying a Trained TensorFlow Model to Vertex AI 327

Concepts 328

Uploading Model 328

Creating Endpoint 330

Deploying Model to Endpoint 330

Invoking the Deployed Model 331

Summary 332

Suggested Resources 333

10 Getting Ready for MLOps with Vertex AI 335

Developing and Deploying Using Python 336

Writing model.py 337

Writing the Training Pipeline 338

Predefined Split 340

AutoML 341

Hyperparameter Tuning 343

Parameterize Model 344

Shorten Training Run 345

Metrics During Training 347

Hyperparameter Tuning Pipeline 347

Best Trial to Completion 349

Explaining the Model 350

Configuring Explanations Metadata 350

Creating and Deploying Model 352

Obtaining Explanations 352

Summary 354

Suggested Resources 355

11 Time-Windowed Features for Real-Time Machine Learning 357

Time Averages 357

Apache Beam and Cloud Dataflow 358

Reading and Writing 360

Time Windowing 362

Machine Learning Training 367

Machine Learning Dataset 367

Training the Model 373

Streaming Predictions 376

Reuse Transforms 377

Input and Output 379

Invoking Model 380

Reusing Endpoint 381

Batching Predictions 384

Streaming Pipeline 385

Writing to BigQuery 385

Executing Streaming Pipeline 386

Late and Out-of-Order Records 387

Possible Streaming Sinks 393

Summary 400

Suggested Resources 401

12 The Full Dataset 403

Four Years of Data 403

Creating Dataset 404

Training Model 409

Evaluation 411

Summary 417

Suggested Resources 417

Conclusion 419

Considerations for Sensitive Data Within Machine Learning Datasets 423

Index 431

From the B&N Reads Blog

Customer Reviews