内容简介: 本书是软件开发人员在网站灾难性故障中的优选生存指南。随着企业力求实现正常运行时间的很大化,站点可靠性工程(Site Reliability Engineering,SRE)首当其冲。当你的站点出现问题,修复故障已经迫在眉睫的时候,本书可以作为一个手把手的操作框架。Nat Welch在可靠性工程方面丰富的实战经验源自于Internet上某些很大的公司,这些公司对于系统中断事件极为敏感。他所用于监控现代Web服务、设置警报和评估事件响应的方法都经过了实践的考验,学会这些必将助你一臂之力。 目录: Preface Chapter 1: Introduction A brief history What is SRE? What is in the book? SRE as a framework for new projects Summary References Chapter 2: Monitoring Why monitoring? Instrumenting an application What should we measure? A short introduction to SLIs, SLOs, and error budgets Service levels Error budgets Collecting and saving monitoring data Polling applications Nagios Prometheus Cacti Sensu Push applications StatsD Telegraf ELK Displaying monitoring information Arbitrary queries Graphs Dashboards Chatbots Managing and maintaining monitoring data Communicating about monitoring Do they even know t ...