#SiteReliability

Thomas Byernthomas_byern@c.im
2026-01-09

Every system works perfectly until it meets DNS, timezones, certificates, or humans.
Usually at the same time.
In production.
On a Friday.

Experience is just pattern recognition with better alerts.

#Production #DevOps #SiteReliability #EngineeringHumor #IncidentResponse #OnCall #TechReality #ByernNotes

Olivia Madisonapmtool
2025-10-29

πŸš€ Tired of slow applications and rising bounce rates?

Even milliseconds matter when it comes to user experience. Our latest guide covers 10 proven APM best practices to reduce latency and improve response time across your entire stack.

Faster apps = happier users = better business outcomes.

πŸ“– Read the full post here: atatus.com/blog/apm-best-pract

2025-10-20

Today's AWS outage was a stark reminder: what happens when the tools you rely on to manage incidents... are part of the incident?

When Slack, Zoom, PagerDuty, and even Statuspage are impacted, how do you get your response team re-connected to solve the underlying problem? Once they're talking to each other, they can improvise a response, but that first step of re-establishing contact is critical.

This isn't just a hypothetical. It's a real-world scenario that can paralyze even the most prepared organizations. Relying on a plan that's tucked away in a long-forgotten document is a recipe for disaster.

Here's what I recommend to the leaders I advise:

πŸ”Ή Have a "Rally Point" Plan: Don't just have a backup concept; have a pre-defined, communicated, and accessible fallback plan. Every second counts in an incident, and you can't waste time figuring out where to communicate. If you normally use Slack and Zoom, then think Google Meet or Microsoft Teams for your backup, and vice versa. Maybe even an old-fashioned conference call bridge. The key is that everyone knows where to go, when the normal places aren't working.

πŸ”Ή Make it Accessible: Your plan is useless if it's on a server that nobody can get to at the moment. Laminated wallet cards, a shared password vault with offline access, or a regularly updated file on every employee's laptop are all viable options.

πŸ”Ή Practice, Practice, Practice: Fire drills aren't just for fires. Run drills for your fallback communication plan. This ensures everyone remembers it exists and that the mechanisms still work.

πŸ”Ή Don't Forget Security: Assume that your fallback channel is compromised, and that outsiders are listening in. Use it just as a rendezvous point to direct responders to more secure, authenticated channels, where you can validate every participant. Don't discuss sensitive information in the open.

Incidents are costly, not just in revenue, but in reputation and team morale. Proactive preparation isn't a luxury; it's a necessity.

What's your team's communication fallback plan? Share your thoughts in the comments below. πŸ‘‡

#IncidentManagement #BusinessContinuity #SiteReliability #DevOps #AWSOutage

πŸš€ We recently helped a client stuck on a slow host migrate their Umbraco site to UmbHost β€” faster, safer, zero downtime.

βœ… Free migration assistance
βœ… Daily backups with 7-day retention
βœ… DDoS protection & Cloudflare CDN
βœ… 99.9% uptime guarantee
βœ… UK-based expert support

Need hosting that cares? Drop us a message!

umbhost.net/hosting/cloud-umbr

#Umbraco #Migration #WebHosting #DevOps #SiteReliability

⏳ Downtime costs more than you think β€” lost sales, frustrated users, damaged reputation.

UmbHost offers 99.9% uptime SLA with UK-based support and certified Umbraco experts.

Typical ticket resolution under 20 minutes.

Want reliable hosting that has your back?

umbhost.net/hosting/cloud-umbr

#WebHosting #Umbraco #SiteReliability #TechSupport

Ismail Kovvuruismailkovvuru
2025-08-10

Strengthen your cloud systems with the top Chaos Engineering tools for DR β€” AWS FIS, Gremlin, Chaos Mesh, and Steadybit. Learn how to simulate failures, boost uptime, and improve resilience.
πŸ“– medium.com/@ismailkovvuru/chao

Ismail Kovvuruismailkovvuru
2025-08-03

DevOps friends πŸš€ β€” Here’s a compact guide every AWS engineer needs:
πŸ” Learn the real-world impact of HTTP status codes in CI/CD, monitoring, and production troubleshooting.
πŸ“š Must-read: medium.com/@ismailkovvuru/http

2024-11-18

Hannaford's recent weeklong outage has me wondering: Do companies truly understand the cost of cutting corners on engineering talent?
These unacceptably long outages which are more frequently occurring at major retailers highlights a common problem I'm seeing in tech: undervaluing highly experienced & knowledgeable engineers. It's way past time for companies to rethink their hiring priorities... stop cheaping out on your Ops and Sec talent, it's going to cost you far more in the end!
I'm exceptionally good at building reliable & resilient systems & teams, so it's super frustrating to be unemployed while witnessing preventable outages for which I could have made a difference. Yes, it's true, 30+ years of engineering experience doesn't come cheap, but I'm damn sure my price is far less than the loss in revenue from a weeklong eComm outage at a major business!
Anyway, if yer looking for a decent engineer/leader, please reach out...
#open_to_work #engineering #siteReliability #Technology

mainepublic.org/business-and-e

2024-09-23

No, I did not want to have a system-wide outage this morning, thankyouverymuch 😰

(but we recovered, although not without some sweating. Aren't new and different failure modes fun?)

(no, I'm not an SRE but we're a small shop)

#onCall #siteReliability #SRE

Dotan Horovits #CNCFAmbassadorhorovits@fosstodon.org
2024-05-29

"What should I monitor? Am I tracking the right metrics?" πŸ“ˆπŸ“Š
Common industry metrics frameworks provide useful monitoring guidance for #DevOps and #SRE.
Here's a good overview for the different methods:
logz.io/blog/evops-sre-metrics
#monitoring #observability #sitereliability

2024-02-13

No one ever complains about #steam going down or being slow, despite tens of millions of concurrent users at all times. I'd like to know more about how Valve manages that. The service itself is practically transparent. #sitereliability #devops #cloud #CloudComputing #videogames

Dotan Horovits #CNCFAmbassadorhorovits@fosstodon.org
2024-02-07

Life of a SRE. I love this pic by
@attachmentgenie @cfgmgmtcamp .
It only shows how unsustainable this screen gazing approach is, with today's #microservices #cloudnative systems.
Time to revisit your #siteReliability practices
medium.com/@horovits/sre-revis
#CfgMgmtCamp #SRE #DevOps

2023-11-10

⚠️Massive outage hits Australia's second-largest telecom provider, leaving millions stranded without mobile and internet services. Imagine that's happening to you! Let's explain and try to avoid it:
relianoid.com/blog/australian-
#TelecomOutage #SiteReliability #RELIANOID #TelecomDisruption #NetworkOutage #TechDowntime #ServiceRestoration #SiteReliabilityEngineering #HighAvailability #TelecomResilience #TechFailures #NetworkReliability #Australia #Australiaattack #outage #vulnerabilities

Mohammed S. Al SahafMohammedSahaf@hachyderm.io
2023-04-12

Here are the steps to enable #http3/#quic in #caddy:
....

It takes 0, zero, nil lines to enable and configure #http3/#quic in #CaddyServer! You don't need to do anything special to keep up with the industry standard and progress. Caddy takes care of keeping your services up-to-date.

#systemadministration #sysadmin #devops #sre #web #linux #unix #windows #sitereliability

On-Call Me Maybe Podcastoncallmemaybe
2023-02-28

Catch OCMM co-hosts @adrianamvillela and @anamedina at SLOconf this year!

Pam ✌️ongpamtaro@hachyderm.io
2022-12-17

#Introduction πŸ‘‹ Hello World!

I’m a proud #dogMom that loves to overshare photos of my #rescue #dog (Cassie).

Bringing #diversityEquitiyInclusion to #tech motivates me.

Professionally, I’ve had a long career in #softwareEngineering, but am now on a journey in the world of #siteReliability #engineering.

Sometimes I’ll also post things about #food, #coffee, #whiskey / #whisky, #wine, #travel, #nba #basketball, and #snowboarding.

#introductions #dei #womenwhocode #sre #developer #dogs

dog dressed as a taco

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst