![97 Things Every SRE Should Know: Collective Wisdom from the Experts](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
97 Things Every SRE Should Know: Collective Wisdom from the Experts
250![97 Things Every SRE Should Know: Collective Wisdom from the Experts](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
97 Things Every SRE Should Know: Collective Wisdom from the Experts
250Paperback
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
Editors Jaime Woo and Emil Stolarsky, co-founders of Incident Labs, have collected 97 concise and useful tips from across the industry, including trusted best practices and new approaches to knotty problems. You'll grow and refine your SRE skills through sound advice and thought-provoking questions that drive the direction of the field.
Some of the 97 things you should know:
- "Test Your Disaster Plan"Tanya Reilly
- "Integrating Empathy into SRE Tools"Daniella Niyonkuru
- "The Best Advice I Can Give to Teams"Nicole Forsgren
- "Where to SRE"Fatema Boxwala
- "Facing That First Page"Andrew Louis
- "I Have an Error Budget, Now What?"Alex Hidalgo
- "Get Your Work Recognized: Write a Brag Document"Julia Evans and Karla Burnett
Product Details
ISBN-13: | 9781492081494 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 12/29/2020 |
Pages: | 250 |
Product dimensions: | 6.00(w) x 9.00(h) x (d) |
About the Author
Jaime Woo is an award-nominated writer, and is a frequent speaker at SREcon EMEA, Americas West, and Americas East. He spent three years as a molecular biologist, before working at DigitalOcean, Riot, and Shopify, where he launched the engineering communications function.
Table of Contents
Preface xiii
Part I New to SRE
1 Site Reliability Engineering in Six Words Alex Hidalgo 2
2 Do We Know Why We Really Want Reliability? Niall Murphy 4
3 Building Self-Regulating Processes Denise Yu 6
4 Four Engineers of an SRE Seder Jacob Scott 8
5 The Reliability Stack Alex Hidalgo 10
6 Infrastructure: It's Where the Power Is Charity Majors 12
7 Thinking About Resilience Justin Li 14
8 Observability in the Development Cycle Charity Majors Liz Fong-Jones 16
9 There Is No Magic Bouke van der Bijl 18
10 How Wikipedia Is Served to You Effie Mouzeli 20
11 Why You Should Understand (a Little) About TCP Julia Evans 22
12 The Importance of a Management Interface Salim Virji 24
13 When It Comes to Storage, Think Distributed Salim Virji 26
14 The Role of Cardinality Charity Majors Liz Fong-Jones 28
15 Security Is like an Onion Lucas Fontes 30
16 Use Your Words Tanya Reilly 32
17 Where to SRE Fatema Boxwala 34
18 Dear Future Team Frances Rees 36
19 Sustainability and Burnout Denise Yu 38
20 Don't Take Advice from Graybeards John Looney 40
21 Facing That First Page Andrew Louis 42
Part II Zero to One
22 SRE, at Any Size, Is Cultural Matthew Huxtable 45
23 Everyone Is an SRE in a Small Organization Matthew Huxtable 47
24 Auditing Your Environment for Improvements Joan O'Callaghan 49
25 With Incident Response, Start Small Thai Wood 51
26 Solo SRE: Effecting Large-Scale Change as a Single Individual Ashley Poole 53
27 Design Goals for SLO Measurement Ben Sigelman 55
28 I Have an Error Budget-Now What? Alex Hidalgo 57
29 How to Change Things Joan O'Callaghan 59
30 Methodological Debugging Avishai Ish-Shalom Nati Cohen 61
31 How Startups Can Build an SRE Mindset Tamara Miner 63
32 Bootstrapping SRE in Enterprises Vanessa Yiu 65
33 It's Okay Not to Know, and It's Okay to Be Wrong Todd Palino 67
34 Storytelling Is a Superpower Anita Clarke 69
35 Get Your Work Recognized: Write a Brag Document Julia Evans Karla Burnett 71
Part III One to Ten
36 Making Work Visible Lorin Hochstein 74
37 An Overlooked Engineering Skill Murali Suriar 76
38 Unpacking the On-Call Divide Jason Hand 78
39 The Maestros of Incident Response Andrew Louis 80
40 Effortless Incident Management Suhail Patel Miles Bryant Chris Evans 82
41 If You're Doing Runbooks, Do Them Well Spike Lindsey 84
42 Why I Hate Our Playbooks Frances Rees 86
43 What Machines Do Well Michelle Brush 88
44 Integrating Empathy into SRE Tools Daniella Niyonkuru 90
45 Using ChatOps to Implement Empathy Daniella Niyonkuru 93
46 Move Fast to Unbreak Things Michelle Brush 95
47 You Don't Know for Sure Until It Runs in Production Ingrid Epure 97
48 Sometimes the Fix Is the Problem Jake Pittis 99
49 Legendary Elise Gale 101
50 Metrics Are Not SLIs (The Measure Everything Trap) Brian Murphy 103
51 When SLOs Attack: Pathological SLOs and How to Fix Them Narayan Desai 105
52 Holistic Approach to Product Reliability Kristine Chen Bart Ponurkiewicz 107
53 In Search of the Lost Time Ingrid Epure 109
54 Unexpected Lessons from Office Hours Tamara Miner 111
55 Building Tools for Internal Customers that They Actually Want to Use Vinessa Wan 113
56 It's About the Individuals and Interactions Vinessa Wan 115
57 The Human Baseline in SRE Effie Mouzeli 117
58 Remotely Productive or Productively Remote Avleen Vig 119
59 Of Margins and Individuals Kurt Andersen 121
60 The Importance of Margins in Systems Kurt Andersen 123
61 Fewer Spreadsheets, More Napkins Jacob Bednarz 125
62 Sneaking in Your DevOps Deliciously Vinessa Wan 127
63 Effecting SRE Cultural Changes in Enterprises Vanessa Yiu 129
64 To All the SREs I've Loved Felix Glaser 131
65 Complex: The Most Overloaded Word in Technology Laura Nolan 133
Part IV Ten to Hundred
66 The Best Advice I Can Give to Teams Nicole Forsgren 136
67 Create Your Supporting Artifacts Daria Barteneva Eva Parish 138
68 The Order of Operations for Getting SLO Buy-In David K. Rensin 140
69 Heroes Are Necessary, but Hero Culture Is Not Lei Lopez 142
70 On-Call Rotations that People Want to Join Miles Bryant Chris Evans Suhail Patel 144
71 Study of Human Factors and Team Culture to Improve Pager Fatigue Daria Barteneva 146
72 Optimize for MTTBTB (Mean Time to Back to Bed) Spike Lindsey 148
73 Mitigating and Preventing Cascading Failures Rita Lu 150
74 On-Call Health: The Metric You Could Be Measuring Caitie McCaffrey 152
75 Helping Leaders Prioritize On-Call Health Caitie McCaffrey 154
76 The SRE as a Diplomat Johnny Boursiquot 156
77 The Forward-Deployed SRE Johnny Boursiquot 158
78 Test Your Disaster Plan Tanya Reilly 160
79 Why Training Matters to an SRE Practice and SRE Matters to Your Training Program Jennifer Petoff 162
80 The Power of Uniformity Chris Evans Suhail Patel Miles Bryant 164
81 Bytes per User Value Arshia Mufti 166
82 Make Your Engineering Blog a Priority Anita Clarke 168
83 Don't Let Anyone Run Code in Your Context John Looney 170
84 Trading Places: SRE and Product Shubheksha Jalan 172
85 You See Teams, I See Product Avleen Vig 174
86 The Performance Emergency Fund Dawn Parzych 176
87 Important but Not Urgent: Roadmaps for SREs Laura Nolan 178
Part V The Future of SRE
88 That 50% Thing Tanya Reilly 181
89 Following the Path of Safety-Critical Systems Heidy Khlaaf 183
90 Applicable and Achievable Static Analysis Heidy Khlaaf 185
91 The Importance of Formal Specification Hillel Wayne 187
92 Risk and Rot in Sociotechnical Systems Laura Nolan 189
93 SRE in Crisis Niall Murphy 191
94 Expected Risk Limitations Blake Bisset 193
95 Beyond Local Risk: Accounting for Angry Birds Blake Bisset 195
96 A Word from Software Safety Nerds J. Paul Reed 197
97 Incidents: A Window into Gaps Lorin Hochstein 199
98 The Third Age of SRE Björn "Beorn" Rabenstein 201
Contributors 203
Index 225
About the Editors 232