Sometimes software applications are behaving “normally” along strict definitions of HTTP statuses but under the surface, something is terribly wrong. In 2017, Checkr’s most important API endpoint went down for 12 hours without detection. This talk describes this incident, how it was handled (what went well and what could have gone better) and explore how you could harden your systems today with simple monitoring patterns.
Conference organizer: https://railsconf.com/