![Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning
459![Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning
459Paperback(2nd ed.)
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way.
You'll learn how to:
- Employ best practices in building highly scalable data and ML pipelines on Google Cloud
- Automate and schedule data ingest using Cloud Run
- Create and populate a dashboard in Data Studio
- Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
- Conduct interactive data exploration with BigQuery
- Create a Bayesian model with Spark on Cloud Dataproc
- Forecast time series and do anomaly detection with BigQuery ML
- Aggregate within time windows with Dataflow
- Train explainable machine learning models with Vertex AI
- Operationalize ML with Vertex AI Pipelines
Product Details
ISBN-13: | 9781098118952 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 05/03/2022 |
Edition description: | 2nd ed. |
Pages: | 459 |
Product dimensions: | 7.00(w) x 9.19(h) x 0.93(d) |
About the Author
Table of Contents
Preface xi
1 Making Better Decisions Based on Data 1
Many Similar Decisions 4
The Role of Data Scientists 5
Scrappy Environment 7
Full Stack Cloud Data Scientists 8
Collaboration 9
Best Practices 10
Simple to Complex Solutions 11
Cloud Computing 11
Serverless 12
A Probabilistic Decision 13
Probabilistic Approach 15
Probability Density Function 16
Cumulative Distribution Function 17
Choices Made 18
Choosing Cloud 19
Not a Reference Book 19
Getting Started with the Code 20
Agile Architecture for Data Science on Google Cloud 22
What Is Agile Architecture? 23
No-Code, Low-Code 23
Use Managed Services 24
Summary 25
Suggested Resources 26
2 Ingesting Data into the Cloud 29
Airline On-Time Performance Data 30
Knowability 31
Causality 31
Training-Serving Skew 32
Downloading Data 33
Hub-and-Spoke Architecture 34
Dataset Fields 35
Separation of Compute and Storage 37
Scaling Up 39
Scaling Out with Sharded Data 41
Scaling Out with Data-in-Place 43
Ingesting Data 46
Reverse Engineering a Web Form 46
Dataset Download 48
Exploration and Cleanup 50
Uploading Data to Google Cloud Storage 51
Loading Data into Google BigQuery 55
Advantages of a Serverless Columnar Database 55
Staging on Cloud Storage 57
Access Control 57
Ingesting CSV Files 61
Partitioning 62
Scheduling Monthly Downloads 63
Ingesting in Python 65
Cloud Run 71
Securing Cloud Run 72
Deploying and Invoking Cloud Run 74
Scheduling Cloud Run 75
Summary 76
Code Break 77
Suggested Resources 78
3 Creating Compelling Dashboards 81
Explain Your Model with Dashboards 83
Why Build a Dashboard First? 84
Accuracy, Honesty, and Good Design 86
Loading Data into Cloud SQL 88
Create a Google Cloud SQL Instance 89
Create Table of Data 92
Interacting with the Database 95
Querying Using BigQuery 96
Schema Exploration 96
Using Preview 97
Using Table Explorer 99
Creating BigQuery View 100
Building Our First Model 101
Contingency Table 101
Threshold Optimization 103
Building a Dashboard 106
Getting Started with Data Studio 107
Creating Charts 109
Adding End-User Controls 110
Showing Proportions with a Pie Chart 112
Explaining a Contingency Table 117
Modern Business Intelligence 119
Digitization 119
Natural Language Queries 120
Connected Sheets 122
Summary 123
Suggested Resources 123
4 Streaming Data: Publication and Ingest with Pub/Sub and Dataflow 125
Designing the Event Feed 126
Transformations Needed 127
Architecture 128
Getting Airport Information 129
Sharing Data 132
Time Correction 133
Apache Beam/Cloud Dataflow 135
Parsing Airports Data 136
Adding Time Zone Information 139
Converting Times to UTC 141
Correcting Dates 144
Creating Events 146
Reading and Writing to the Cloud 148
Running the Pipeline in the Cloud 150
Publishing an Event Stream to Cloud Pub/Sub 153
Speed-Up Factor 154
Get Records to Publish 155
How Many Topics? 156
Iterating Through Records 157
Building a Batch of Events 158
Publishing a Batch of Events 159
Real-Time Stream Processing 160
Streaming in Dataflow 160
Windowing a Pipeline 162
Streaming Aggregation 162
Using Event Timestamps 165
Executing the Stream Processing 166
Analyzing Streaming Data in BigQuery 168
Real-Time Dashboard 169
Summary 170
Suggested Resources 171
5 Interactive Data Exploration with Vertex AI Workbench 173
Exploratory Data Analysis 174
Exploration with SQL 177
Reading a Query Explanation 179
Exploratory Data Analysis in Vertex AI Workbench 184
Jupyter Notebooks 185
Creating a Notebook 186
Jupyter Commands 188
Installing Packages 188
Jupyter Magic for Google Cloud 189
Exploring Arrival Delays 190
Basic Statistics 191
Plotting Distributions 191
Quality Control 194
Arrival Delay Conditioned on Departure Delay 199
Evaluating the Model 204
Random Shuffling 204
Splitting by Date 205
Training and Testing 206
Summary 210
Suggested Resources 210
6 Bayesian Classifier with Apache Spark on Cloud Dataproc 211
MapReduce and the Hadoop Ecosystem 211
How MapReduce Works 212
Apache Hadoop 214
Google Cloud Dataproc 214
Need for Higher-Level Tools 216
Jobs, Not Clusters 217
Preinstalling Software 219
Quantization Using Spark SQL 221
JupyterLab on Cloud Dataproc 222
Independence Check Using BigQuery 223
Spark SQL in JupyterLab 225
Histogram Equalization 227
Bayesian Classification 231
Bayes in Each Bin 231
Evaluating the Model 233
Dynamically Resizing Clusters 234
Comparing to Single Threshold Model 235
Orchestration 238
Submitting a Spark Job 238
Workflow Template 238
Cloud Composer 239
Autoscaling 240
Serverless Spark 241
Summary 242
Suggested Resources 243
7 Logistic Regression Using Spark ML 245
Logistic Regression 246
How Logistic Regression Works 246
Spark ML Library 249
Getting Started with Spark Machine Learning 250
Spark Logistic Regression 251
Creating a Training Dataset 252
Training the Model 256
Predicting Using the Model 259
Evaluating a Model 260
Feature Engineering 263
Experimental Framework 263
Feature Selection 267
Feature Transformations 271
Feature Creation 274
Categorical Variables 278
Repeatable, Real Time 280
Summary 281
Suggested Resources 282
8 Machine Learning with BigQuery ML 283
Logistic Regression 283
Presplit Data 285
Interrogating the Model 286
Evaluating the Model 287
Scale and Simplicity 289
Nonlinear Machine Learning 290
XGBoost 290
Hyperparameter Tuning 292
Vertex AI AutoML Tables 294
Time Window Features 296
Taxi-Out Time 296
Compounding Delays 298
Causality 299
Time Features 300
Departure Hour 300
Transform Clause 302
Categorical Variable 303
Feature Cross 303
Summary 305
Suggested Resources 306
9 Machine Learning with TensorFlow in Vertex AI 309
Toward More Complex Models 310
Preparing BigQuery Data for TensorFlow 314
Reading Data into TensorFlow 315
Training and Evaluation in Keras 317
Model Function 317
Features 318
Inputs 320
Training the Keras Model 320
Saving and Exporting 322
Deep Neural Network 322
Wide-and-Deep Model in Keras 323
Representing Air Traffic Corridors 323
Bucketing 324
Feature Crossing 325
Wide-and-Deep Classifier 326
Deploying a Trained TensorFlow Model to Vertex AI 327
Concepts 328
Uploading Model 328
Creating Endpoint 330
Deploying Model to Endpoint 330
Invoking the Deployed Model 331
Summary 332
Suggested Resources 333
10 Getting Ready for MLOps with Vertex AI 335
Developing and Deploying Using Python 336
Writing model.py 337
Writing the Training Pipeline 338
Predefined Split 340
AutoML 341
Hyperparameter Tuning 343
Parameterize Model 344
Shorten Training Run 345
Metrics During Training 347
Hyperparameter Tuning Pipeline 347
Best Trial to Completion 349
Explaining the Model 350
Configuring Explanations Metadata 350
Creating and Deploying Model 352
Obtaining Explanations 352
Summary 354
Suggested Resources 355
11 Time-Windowed Features for Real-Time Machine Learning 357
Time Averages 357
Apache Beam and Cloud Dataflow 358
Reading and Writing 360
Time Windowing 362
Machine Learning Training 367
Machine Learning Dataset 367
Training the Model 373
Streaming Predictions 376
Reuse Transforms 377
Input and Output 379
Invoking Model 380
Reusing Endpoint 381
Batching Predictions 384
Streaming Pipeline 385
Writing to BigQuery 385
Executing Streaming Pipeline 386
Late and Out-of-Order Records 387
Possible Streaming Sinks 393
Summary 400
Suggested Resources 401
12 The Full Dataset 403
Four Years of Data 403
Creating Dataset 404
Training Model 409
Evaluation 411
Summary 417
Suggested Resources 417
Conclusion 419
Considerations for Sensitive Data Within Machine Learning Datasets 423
Index 431