# Site Reliability Engineering (SRE)
*Site reliability engineering (SRE) is a set of principles and practices[1] that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[2] The main goals are to create scalable and highly reliable software systems.[2] Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.[2][3]* - [Source](https://www.wikiwand.com/en/Site_reliability_engineering): [[Wikipedia]]
## [Awesome Site Reliability Engineering](https://github.com/dastergon/awesome-sre)
*A curated list of awesome Site Reliability and Production Engineering resources.* ([[GitHub]])
### Contents
- [Culture](https://github.com/dastergon/awesome-sre#culture)
- [Education](https://github.com/dastergon/awesome-sre#education)
- [Books](https://github.com/dastergon/awesome-sre#books)
- [Hiring](https://github.com/dastergon/awesome-sre#hiring)
- [Reliability](https://github.com/dastergon/awesome-sre#reliability)
- [Monitoring & Observability & Alerting](https://github.com/dastergon/awesome-sre#monitoring--observability--alerting)
- [On-Call](https://github.com/dastergon/awesome-sre#on-call)
- [Post-Mortem](https://github.com/dastergon/awesome-sre#post-mortem)
- [Capacity Planning](https://github.com/dastergon/awesome-sre#capacity-planning)
- [Service Level Agreement](https://github.com/dastergon/awesome-sre#service-level-agreement)
- [Performance](https://github.com/dastergon/awesome-sre#performance)
- [Programming](https://github.com/dastergon/awesome-sre#programming)
- [Misc Articles](https://github.com/dastergon/awesome-sre#misc-articles)
- [Real-time Messaging](https://github.com/dastergon/awesome-sre#real-time-messaging)
- [Blogs](https://github.com/dastergon/awesome-sre#blogs)
- [Newsletters](https://github.com/dastergon/awesome-sre#newsletters)
- [Conferences & Meetups](https://github.com/dastergon/awesome-sre#conferences--meetups)
- [Twitter](https://github.com/dastergon/awesome-sre#twitter)
- [SRE Tools](https://github.com/dastergon/awesome-sre#sre-tools)
## [Amazon Builders' Library](https://aws.amazon.com/builders-library/)
Links: [[Amazon]]
### [Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/)
### [[Shuffle Sharding]]
![[Shuffle Sharding#Workload isolation using shuffle-sharding https aws amazon com builders-library workload-isolation-using-shuffle-sharding]]
## [Google SRE Books](https://sre.google/books/)
Links: [[Google]]
### Building Secure & Reliable Systems
*Best Practices for Designing Implementing and Maintaining Systems*
557 pages
https://sre.google/static/pdf/building_secure_and_reliable_systems.pdf
![[Screen Shot 2022-02-06 at 12.05.52 AM.png]]
![[Screen Shot 2022-02-06 at 12.08.17 AM.png]]
### The Site Reliability Workbook
*Practical Ways to Implement SRE*
https://sre.google/workbook/table-of-contents/
- [Table of Contents](https://sre.google/workbook/table-of-contents/)
- [Foreword I](https://sre.google/workbook/foreword-I/)
- [Foreword II](https://sre.google/workbook/foreword-II/)
- [Preface](https://sre.google/workbook/preface/)
- [Chapter 1 - How SRE Relates to DevOps](https://sre.google/workbook/how-sre-relates/)
- [**Part I - Foundations**](https://sre.google/workbook/part-I-foundations/)
- [Chapter 2 - Implementing SLOs](https://sre.google/workbook/implementing-slos/)
- [Chapter 3 - SLO Engineering Case Studies](https://sre.google/workbook/slo-engineering-case-studies/)
- [Chapter 4 - Monitoring](https://sre.google/workbook/monitoring/)
- [Chapter 5 - Alerting on SLOs](https://sre.google/workbook/alerting-on-slos/)
- [Chapter 6 - Eliminating Toil](https://sre.google/workbook/eliminating-toil/)
- [Chapter 7 - Simplicity](https://sre.google/workbook/simplicity/)
- [**Part II - Practices**](https://sre.google/workbook/part-II-practices/)
- [Chapter 8 - On-Call](https://sre.google/workbook/on-call/)
- [Chapter 9 - Incident Response](https://sre.google/workbook/incident-response/)
- [Chapter 10 - Postmortem Culture: Learning from Failure](https://sre.google/workbook/postmortem-culture/)
- [Chapter 11 - Managing Load](https://sre.google/workbook/managing-load/)
- [Chapter 12 - Introducing Non-Abstract Large System Design](https://sre.google/workbook/non-abstract-design/)
- [Chapter 13 - Data Processing Pipelines](https://sre.google/workbook/data-processing/)
- [Chapter 14 - Configuration Design and Best Practices](https://sre.google/workbook/configuration-design/)
- [Chapter 15 - Configuration Specifics](https://sre.google/workbook/configuration-specifics/)
- [Chapter 16 - Canarying Releases](https://sre.google/workbook/canarying-releases/)
- [**Part III - Processes**](https://sre.google/workbook/part-III-processes/)
- [Chapter 17 - Identifying and Recovering from Overload](https://sre.google/workbook/overload/)
- [Chapter 18 - SRE Engagement Model](https://sre.google/workbook/engagement-model/)
- [Chapter 19 - SRE: Reaching Beyond Your Walls](https://sre.google/workbook/reaching-beyond/)
- [Chapter 20 - SRE Team Lifecycles](https://sre.google/workbook/team-lifecycles/)
- [Chapter 21 - Organizational Change Management in SRE](https://sre.google/workbook/organizational-change/)
- [**Conclusion**](https://sre.google/workbook/conclusion/)
- [Appendix A - Example SLO Document](https://sre.google/workbook/slo-document/)
- [Appendix B - Example Error Budget Policy](https://sre.google/workbook/error-budget-policy/)
- [Appendix C - Results of Postmortem Analysis](https://sre.google/workbook/postmortem-analysis/)
- [Index](https://sre.google/workbook/index/)
- [About the Editors](https://sre.google/workbook/editors/)
- [Colophon](https://sre.google/workbook/colophon/)
### Site Reliability Engineering
*How [[Google]] Runs Production Systems*
https://sre.google/sre-book/table-of-contents/
- [Table of Contents](https://sre.google/sre-book/table-of-contents/)
- [Foreword](https://sre.google/sre-book/foreword/)
- [Preface](https://sre.google/sre-book/preface/)
- [**Part I - Introduction**](https://sre.google/sre-book/part-I-introduction/)
- [Chapter 1 - Introduction](https://sre.google/sre-book/introduction/)
- [Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE](https://sre.google/sre-book/production-environment/)
- [**Part II - Principles**](https://sre.google/sre-book/part-II-principles/)
- [Chapter 3 - Embracing Risk](https://sre.google/sre-book/embracing-risk/)
- [Chapter 4 - Service Level Objectives](https://sre.google/sre-book/service-level-objectives/)
- [Chapter 5 - Eliminating Toil](https://sre.google/sre-book/eliminating-toil/)
- [Chapter 6 - Monitoring Distributed Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
- [Chapter 7 - The Evolution of Automation at Google](https://sre.google/sre-book/automation-at-google/)
- [Chapter 8 - Release Engineering](https://sre.google/sre-book/release-engineering/)
- [Chapter 9 - Simplicity](https://sre.google/sre-book/simplicity/)
- [**Part III - Practices**](https://sre.google/sre-book/part-III-practices/)
- [Chapter 10 - Practical Alerting](https://sre.google/sre-book/practical-alerting/)
- [Chapter 11 - Being On-Call](https://sre.google/sre-book/being-on-call/)
- [Chapter 12 - Effective Troubleshooting](https://sre.google/sre-book/effective-troubleshooting/)
- [Chapter 13 - Emergency Response](https://sre.google/sre-book/emergency-response/)
- [Chapter 14 - Managing Incidents](https://sre.google/sre-book/managing-incidents/)
- [Chapter 15 - Postmortem Culture: Learning from Failure](https://sre.google/sre-book/postmortem-culture/)
- [Chapter 16 - Tracking Outages](https://sre.google/sre-book/tracking-outages/)
- [Chapter 17 - Testing for Reliability](https://sre.google/sre-book/testing-reliability/)
- [Chapter 18 - Software Engineering in SRE](https://sre.google/sre-book/software-engineering-in-sre/)
- [Chapter 19 - Load Balancing at the Frontend](https://sre.google/sre-book/load-balancing-frontend/)
- [Chapter 20 - Load Balancing in the Datacenter](https://sre.google/sre-book/load-balancing-datacenter/)
- [Chapter 21 - Handling Overload](https://sre.google/sre-book/handling-overload/)
- [Chapter 22 - Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/)
- [Chapter 23 - Managing Critical State: Distributed Consensus for Reliability](https://sre.google/sre-book/managing-critical-state/)
- [Chapter 24 - Distributed Periodic Scheduling with Cron](https://sre.google/sre-book/distributed-periodic-scheduling/)
- [Chapter 25 - Data Processing Pipelines](https://sre.google/sre-book/data-processing-pipelines/)
- [Chapter 26 - Data Integrity: What You Read Is What You Wrote](https://sre.google/sre-book/data-integrity/)
- [Chapter 27 - Reliable Product Launches at Scale](https://sre.google/sre-book/reliable-product-launches/)
- [**Part IV - Management**](https://sre.google/sre-book/part-IV-management/)
- [Chapter 28 - Accelerating SREs to On-Call and Beyond](https://sre.google/sre-book/accelerating-sre-on-call/)
- [Chapter 29 - Dealing with Interrupts](https://sre.google/sre-book/dealing-with-interrupts/)
- [Chapter 30 - Embedding an SRE to Recover from Operational Overload](https://sre.google/sre-book/operational-overload/)
- [Chapter 31 - Communication and Collaboration in SRE](https://sre.google/sre-book/communication-and-collaboration/)
- [Chapter 32 - The Evolving SRE Engagement Model](https://sre.google/sre-book/evolving-sre-engagement-model/)
- [**Part V - Conclusions**](https://sre.google/sre-book/part-V-conclusions/)
- [Chapter 33 - Lessons Learned from Other Industries](https://sre.google/sre-book/lessons-learned/)
- [Chapter 34 - Conclusion](https://sre.google/sre-book/conclusion/)
- [Appendix A - Availability Table](https://sre.google/sre-book/availability-table/)
- [Appendix B - A Collection of Best Practices for Production Services](https://sre.google/sre-book/service-best-practices/)
- [Appendix C - Example Incident State Document](https://sre.google/sre-book/incident-document/)
- [Appendix D - Example Postmortem](https://sre.google/sre-book/example-postmortem/)
- [Appendix E - Launch Coordination Checklist](https://sre.google/sre-book/launch-checklist/)
- [Appendix F - Example Production Meeting Minutes](https://sre.google/sre-book/production-meeting/)
- [Bibliography](https://sre.google/sre-book/bibliography/)