97 Things Every SRE Should Know: Collective Wisdom from the Experts

97 Things Every SRE Should Know: Collective Wisdom from the Experts

by Emil Stolarsky, Jaime Woo
97 Things Every SRE Should Know: Collective Wisdom from the Experts

97 Things Every SRE Should Know: Collective Wisdom from the Experts

by Emil Stolarsky, Jaime Woo

Paperback

$49.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

Site reliability engineering (SRE) is more relevant than ever. Knowing how to keep systems reliable has become a critical skill. With this practical book, newcomers and old hats alike will explore a broad range of conversations happening in SRE. You'll get actionable advice on several topics, including how to adopt SRE, why SLOs matter, when you need to upgrade your incident response, and how monitoring and observability differ.

Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provoking questions that drive the direction of the field.

Some of the 97 things you should know:

  • "Test Your Disaster Plan"—Tanya Reilly
  • "Integrating Empathy into SRE Tools"—Daniella Niyonkuru
  • "The Best Advice I Can Give to Teams"—Nicole Forsgren
  • "Where to SRE"—Fatema Boxwala
  • "Facing That First Page"—Andrew Louis
  • "I Have an Error Budget, Now What?"—Alex Hidalgo
  • "Get Your Work Recognized: Write a Brag Document"—Julia Evans and Karla Burnett

Product Details

ISBN-13: 9781492081494
Publisher: O'Reilly Media, Incorporated
Publication date: 12/29/2020
Pages: 250
Product dimensions: 6.00(w) x 9.00(h) x (d)

About the Author

Emil Stolarsky is a site reliability engineer, who previously worked on caching, performance, & disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He is the program co-chair for SREcon EMEA 2019 and SREcon Americas West 2020, and contributed a chapter to the O’Reilly book “Seeking SRE.”

Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He spent three years as a molecular biologist, before working at DigitalOcean, Riot, and Shopify, where he launched the engineering communications function.

Table of Contents

Preface xiii

Part I New to SRE

1 Site Reliability Engineering in Six Words Alex Hidalgo 2

2 Do We Know Why We Really Want Reliability? Niall Murphy 4

3 Building Self-Regulating Processes Denise Yu 6

4 Four Engineers of an SRE Seder Jacob Scott 8

5 The Reliability Stack Alex Hidalgo 10

6 Infrastructure: It's Where the Power Is Charity Majors 12

7 Thinking About Resilience Justin Li 14

8 Observability in the Development Cycle Charity Majors Liz Fong-Jones 16

9 There Is No Magic Bouke van der Bijl 18

10 How Wikipedia Is Served to You Effie Mouzeli 20

11 Why You Should Understand (a Little) About TCP Julia Evans 22

12 The Importance of a Management Interface Salim Virji 24

13 When It Comes to Storage, Think Distributed Salim Virji 26

14 The Role of Cardinality Charity Majors Liz Fong-Jones 28

15 Security Is like an Onion Lucas Fontes 30

16 Use Your Words Tanya Reilly 32

17 Where to SRE Fatema Boxwala 34

18 Dear Future Team Frances Rees 36

19 Sustainability and Burnout Denise Yu 38

20 Don't Take Advice from Graybeards John Looney 40

21 Facing That First Page Andrew Louis 42

Part II Zero to One

22 SRE, at Any Size, Is Cultural Matthew Huxtable 45

23 Everyone Is an SRE in a Small Organization Matthew Huxtable 47

24 Auditing Your Environment for Improvements Joan O'Callaghan 49

25 With Incident Response, Start Small Thai Wood 51

26 Solo SRE: Effecting Large-Scale Change as a Single Individual Ashley Poole 53

27 Design Goals for SLO Measurement Ben Sigelman 55

28 I Have an Error Budget-Now What? Alex Hidalgo 57

29 How to Change Things Joan O'Callaghan 59

30 Methodological Debugging Avishai Ish-Shalom Nati Cohen 61

31 How Startups Can Build an SRE Mindset Tamara Miner 63

32 Bootstrapping SRE in Enterprises Vanessa Yiu 65

33 It's Okay Not to Know, and It's Okay to Be Wrong Todd Palino 67

34 Storytelling Is a Superpower Anita Clarke 69

35 Get Your Work Recognized: Write a Brag Document Julia Evans Karla Burnett 71

Part III One to Ten

36 Making Work Visible Lorin Hochstein 74

37 An Overlooked Engineering Skill Murali Suriar 76

38 Unpacking the On-Call Divide Jason Hand 78

39 The Maestros of Incident Response Andrew Louis 80

40 Effortless Incident Management Suhail Patel Miles Bryant Chris Evans 82

41 If You're Doing Runbooks, Do Them Well Spike Lindsey 84

42 Why I Hate Our Playbooks Frances Rees 86

43 What Machines Do Well Michelle Brush 88

44 Integrating Empathy into SRE Tools Daniella Niyonkuru 90

45 Using ChatOps to Implement Empathy Daniella Niyonkuru 93

46 Move Fast to Unbreak Things Michelle Brush 95

47 You Don't Know for Sure Until It Runs in Production Ingrid Epure 97

48 Sometimes the Fix Is the Problem Jake Pittis 99

49 Legendary Elise Gale 101

50 Metrics Are Not SLIs (The Measure Everything Trap) Brian Murphy 103

51 When SLOs Attack: Pathological SLOs and How to Fix Them Narayan Desai 105

52 Holistic Approach to Product Reliability Kristine Chen Bart Ponurkiewicz 107

53 In Search of the Lost Time Ingrid Epure 109

54 Unexpected Lessons from Office Hours Tamara Miner 111

55 Building Tools for Internal Customers that They Actually Want to Use Vinessa Wan 113

56 It's About the Individuals and Interactions Vinessa Wan 115

57 The Human Baseline in SRE Effie Mouzeli 117

58 Remotely Productive or Productively Remote Avleen Vig 119

59 Of Margins and Individuals Kurt Andersen 121

60 The Importance of Margins in Systems Kurt Andersen 123

61 Fewer Spreadsheets, More Napkins Jacob Bednarz 125

62 Sneaking in Your DevOps Deliciously Vinessa Wan 127

63 Effecting SRE Cultural Changes in Enterprises Vanessa Yiu 129

64 To All the SREs I've Loved Felix Glaser 131

65 Complex: The Most Overloaded Word in Technology Laura Nolan 133

Part IV Ten to Hundred

66 The Best Advice I Can Give to Teams Nicole Forsgren 136

67 Create Your Supporting Artifacts Daria Barteneva Eva Parish 138

68 The Order of Operations for Getting SLO Buy-In David K. Rensin 140

69 Heroes Are Necessary, but Hero Culture Is Not Lei Lopez 142

70 On-Call Rotations that People Want to Join Miles Bryant Chris Evans Suhail Patel 144

71 Study of Human Factors and Team Culture to Improve Pager Fatigue Daria Barteneva 146

72 Optimize for MTTBTB (Mean Time to Back to Bed) Spike Lindsey 148

73 Mitigating and Preventing Cascading Failures Rita Lu 150

74 On-Call Health: The Metric You Could Be Measuring Caitie McCaffrey 152

75 Helping Leaders Prioritize On-Call Health Caitie McCaffrey 154

76 The SRE as a Diplomat Johnny Boursiquot 156

77 The Forward-Deployed SRE Johnny Boursiquot 158

78 Test Your Disaster Plan Tanya Reilly 160

79 Why Training Matters to an SRE Practice and SRE Matters to Your Training Program Jennifer Petoff 162

80 The Power of Uniformity Chris Evans Suhail Patel Miles Bryant 164

81 Bytes per User Value Arshia Mufti 166

82 Make Your Engineering Blog a Priority Anita Clarke 168

83 Don't Let Anyone Run Code in Your Context John Looney 170

84 Trading Places: SRE and Product Shubheksha Jalan 172

85 You See Teams, I See Product Avleen Vig 174

86 The Performance Emergency Fund Dawn Parzych 176

87 Important but Not Urgent: Roadmaps for SREs Laura Nolan 178

Part V The Future of SRE

88 That 50% Thing Tanya Reilly 181

89 Following the Path of Safety-Critical Systems Heidy Khlaaf 183

90 Applicable and Achievable Static Analysis Heidy Khlaaf 185

91 The Importance of Formal Specification Hillel Wayne 187

92 Risk and Rot in Sociotechnical Systems Laura Nolan 189

93 SRE in Crisis Niall Murphy 191

94 Expected Risk Limitations Blake Bisset 193

95 Beyond Local Risk: Accounting for Angry Birds Blake Bisset 195

96 A Word from Software Safety Nerds J. Paul Reed 197

97 Incidents: A Window into Gaps Lorin Hochstein 199

98 The Third Age of SRE Björn "Beorn" Rabenstein 201

Contributors 203

Index 225

About the Editors 232

From the B&N Reads Blog

Customer Reviews