SRE - Max Fang's Notes

# Site Reliability Engineering (SRE) *Site reliability engineering (SRE) is a set of principles and practices[1] that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[2] The main goals are to create scalable and highly reliable software systems.[2] Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.[2][3]* - [Source](https://www.wikiwand.com/en/Site_reliability_engineering): [[Wikipedia]] ## [Awesome Site Reliability Engineering](https://github.com/dastergon/awesome-sre) *A curated list of awesome Site Reliability and Production Engineering resources.* ([[GitHub]]) ### Contents - [Culture](https://github.com/dastergon/awesome-sre#culture) - [Education](https://github.com/dastergon/awesome-sre#education) - [Books](https://github.com/dastergon/awesome-sre#books) - [Hiring](https://github.com/dastergon/awesome-sre#hiring) - [Reliability](https://github.com/dastergon/awesome-sre#reliability) - [Monitoring & Observability & Alerting](https://github.com/dastergon/awesome-sre#monitoring--observability--alerting) - [On-Call](https://github.com/dastergon/awesome-sre#on-call) - [Post-Mortem](https://github.com/dastergon/awesome-sre#post-mortem) - [Capacity Planning](https://github.com/dastergon/awesome-sre#capacity-planning) - [Service Level Agreement](https://github.com/dastergon/awesome-sre#service-level-agreement) - [Performance](https://github.com/dastergon/awesome-sre#performance) - [Programming](https://github.com/dastergon/awesome-sre#programming) - [Misc Articles](https://github.com/dastergon/awesome-sre#misc-articles) - [Real-time Messaging](https://github.com/dastergon/awesome-sre#real-time-messaging) - [Blogs](https://github.com/dastergon/awesome-sre#blogs) - [Newsletters](https://github.com/dastergon/awesome-sre#newsletters) - [Conferences & Meetups](https://github.com/dastergon/awesome-sre#conferences--meetups) - [Twitter](https://github.com/dastergon/awesome-sre#twitter) - [SRE Tools](https://github.com/dastergon/awesome-sre#sre-tools) ## [Amazon Builders' Library](https://aws.amazon.com/builders-library/) Links: [[Amazon]] ### [Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) ### [[Shuffle Sharding]] ![[Shuffle Sharding#Workload isolation using shuffle-sharding https aws amazon com builders-library workload-isolation-using-shuffle-sharding]] ## [Google SRE Books](https://sre.google/books/) Links: [[Google]] ### Building Secure & Reliable Systems *Best Practices for Designing Implementing and Maintaining Systems* 557 pages https://sre.google/static/pdf/building_secure_and_reliable_systems.pdf ![[Screen Shot 2022-02-06 at 12.05.52 AM.png]] ![[Screen Shot 2022-02-06 at 12.08.17 AM.png]] ### The Site Reliability Workbook *Practical Ways to Implement SRE* https://sre.google/workbook/table-of-contents/ - [Table of Contents](https://sre.google/workbook/table-of-contents/) - [Foreword I](https://sre.google/workbook/foreword-I/) - [Foreword II](https://sre.google/workbook/foreword-II/) - [Preface](https://sre.google/workbook/preface/) - [Chapter 1 - How SRE Relates to DevOps](https://sre.google/workbook/how-sre-relates/) - [**Part I - Foundations**](https://sre.google/workbook/part-I-foundations/) - [Chapter 2 - Implementing SLOs](https://sre.google/workbook/implementing-slos/) - [Chapter 3 - SLO Engineering Case Studies](https://sre.google/workbook/slo-engineering-case-studies/) - [Chapter 4 - Monitoring](https://sre.google/workbook/monitoring/) - [Chapter 5 - Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/) - [Chapter 6 - Eliminating Toil](https://sre.google/workbook/eliminating-toil/) - [Chapter 7 - Simplicity](https://sre.google/workbook/simplicity/) - [**Part II - Practices**](https://sre.google/workbook/part-II-practices/) - [Chapter 8 - On-Call](https://sre.google/workbook/on-call/) - [Chapter 9 - Incident Response](https://sre.google/workbook/incident-response/) - [Chapter 10 - Postmortem Culture: Learning from Failure](https://sre.google/workbook/postmortem-culture/) - [Chapter 11 - Managing Load](https://sre.google/workbook/managing-load/) - [Chapter 12 - Introducing Non-Abstract Large System Design](https://sre.google/workbook/non-abstract-design/) - [Chapter 13 - Data Processing Pipelines](https://sre.google/workbook/data-processing/) - [Chapter 14 - Configuration Design and Best Practices](https://sre.google/workbook/configuration-design/) - [Chapter 15 - Configuration Specifics](https://sre.google/workbook/configuration-specifics/) - [Chapter 16 - Canarying Releases](https://sre.google/workbook/canarying-releases/) - [**Part III - Processes**](https://sre.google/workbook/part-III-processes/) - [Chapter 17 - Identifying and Recovering from Overload](https://sre.google/workbook/overload/) - [Chapter 18 - SRE Engagement Model](https://sre.google/workbook/engagement-model/) - [Chapter 19 - SRE: Reaching Beyond Your Walls](https://sre.google/workbook/reaching-beyond/) - [Chapter 20 - SRE Team Lifecycles](https://sre.google/workbook/team-lifecycles/) - [Chapter 21 - Organizational Change Management in SRE](https://sre.google/workbook/organizational-change/) - [**Conclusion**](https://sre.google/workbook/conclusion/) - [Appendix A - Example SLO Document](https://sre.google/workbook/slo-document/) - [Appendix B - Example Error Budget Policy](https://sre.google/workbook/error-budget-policy/) - [Appendix C - Results of Postmortem Analysis](https://sre.google/workbook/postmortem-analysis/) - [Index](https://sre.google/workbook/index/) - [About the Editors](https://sre.google/workbook/editors/) - [Colophon](https://sre.google/workbook/colophon/) ### Site Reliability Engineering *How [[Google]] Runs Production Systems* https://sre.google/sre-book/table-of-contents/ - [Table of Contents](https://sre.google/sre-book/table-of-contents/) - [Foreword](https://sre.google/sre-book/foreword/) - [Preface](https://sre.google/sre-book/preface/) - [**Part I - Introduction**](https://sre.google/sre-book/part-I-introduction/) - [Chapter 1 - Introduction](https://sre.google/sre-book/introduction/) - [Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE](https://sre.google/sre-book/production-environment/) - [**Part II - Principles**](https://sre.google/sre-book/part-II-principles/) - [Chapter 3 - Embracing Risk](https://sre.google/sre-book/embracing-risk/) - [Chapter 4 - Service Level Objectives](https://sre.google/sre-book/service-level-objectives/) - [Chapter 5 - Eliminating Toil](https://sre.google/sre-book/eliminating-toil/) - [Chapter 6 - Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/) - [Chapter 7 - The Evolution of Automation at Google](https://sre.google/sre-book/automation-at-google/) - [Chapter 8 - Release Engineering](https://sre.google/sre-book/release-engineering/) - [Chapter 9 - Simplicity](https://sre.google/sre-book/simplicity/) - [**Part III - Practices**](https://sre.google/sre-book/part-III-practices/) - [Chapter 10 - Practical Alerting](https://sre.google/sre-book/practical-alerting/) - [Chapter 11 - Being On-Call](https://sre.google/sre-book/being-on-call/) - [Chapter 12 - Effective Troubleshooting](https://sre.google/sre-book/effective-troubleshooting/) - [Chapter 13 - Emergency Response](https://sre.google/sre-book/emergency-response/) - [Chapter 14 - Managing Incidents](https://sre.google/sre-book/managing-incidents/) - [Chapter 15 - Postmortem Culture: Learning from Failure](https://sre.google/sre-book/postmortem-culture/) - [Chapter 16 - Tracking Outages](https://sre.google/sre-book/tracking-outages/) - [Chapter 17 - Testing for Reliability](https://sre.google/sre-book/testing-reliability/) - [Chapter 18 - Software Engineering in SRE](https://sre.google/sre-book/software-engineering-in-sre/) - [Chapter 19 - Load Balancing at the Frontend](https://sre.google/sre-book/load-balancing-frontend/) - [Chapter 20 - Load Balancing in the Datacenter](https://sre.google/sre-book/load-balancing-datacenter/) - [Chapter 21 - Handling Overload](https://sre.google/sre-book/handling-overload/) - [Chapter 22 - Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/) - [Chapter 23 - Managing Critical State: Distributed Consensus for Reliability](https://sre.google/sre-book/managing-critical-state/) - [Chapter 24 - Distributed Periodic Scheduling with Cron](https://sre.google/sre-book/distributed-periodic-scheduling/) - [Chapter 25 - Data Processing Pipelines](https://sre.google/sre-book/data-processing-pipelines/) - [Chapter 26 - Data Integrity: What You Read Is What You Wrote](https://sre.google/sre-book/data-integrity/) - [Chapter 27 - Reliable Product Launches at Scale](https://sre.google/sre-book/reliable-product-launches/) - [**Part IV - Management**](https://sre.google/sre-book/part-IV-management/) - [Chapter 28 - Accelerating SREs to On-Call and Beyond](https://sre.google/sre-book/accelerating-sre-on-call/) - [Chapter 29 - Dealing with Interrupts](https://sre.google/sre-book/dealing-with-interrupts/) - [Chapter 30 - Embedding an SRE to Recover from Operational Overload](https://sre.google/sre-book/operational-overload/) - [Chapter 31 - Communication and Collaboration in SRE](https://sre.google/sre-book/communication-and-collaboration/) - [Chapter 32 - The Evolving SRE Engagement Model](https://sre.google/sre-book/evolving-sre-engagement-model/) - [**Part V - Conclusions**](https://sre.google/sre-book/part-V-conclusions/) - [Chapter 33 - Lessons Learned from Other Industries](https://sre.google/sre-book/lessons-learned/) - [Chapter 34 - Conclusion](https://sre.google/sre-book/conclusion/) - [Appendix A - Availability Table](https://sre.google/sre-book/availability-table/) - [Appendix B - A Collection of Best Practices for Production Services](https://sre.google/sre-book/service-best-practices/) - [Appendix C - Example Incident State Document](https://sre.google/sre-book/incident-document/) - [Appendix D - Example Postmortem](https://sre.google/sre-book/example-postmortem/) - [Appendix E - Launch Coordination Checklist](https://sre.google/sre-book/launch-checklist/) - [Appendix F - Example Production Meeting Minutes](https://sre.google/sre-book/production-meeting/) - [Bibliography](https://sre.google/sre-book/bibliography/)