[The Hub] Better control of downtime detection

1

I have several sites that consitently show up as ‘down’ in the hub. There’s a setting available to keep me from getting notified unless the sites stay down for a while, but there is no setting to determine what constitutes a ‘down’ site.

WPMU support tells me the issue with these sites is slow response. There is probably nothing I can do about that without investing in a faster server and that’s not possible right now. So what I need is to be able to adjust the timeout threshold, not the threshold for reporting.

This is critical for me to effectively manage my sites. At the moment I can’t rely at all on the uptime detection in the hub.

  • Kris Tomczyk
    • Ex Staff

    Hi Chuck

    I hope you are doing well today.

    If I understand this correctly, when it comes to the timeout threshold, it could have a negative impact on the site. As you possibly know, Uptime pings your site’s homepage every 2 minutes
    https://wpmudev.com/docs/hub-2-0/uptime/#response-time

    Those 2 minutes are a safe time for the server to not get a massive ping sent by Uptime. The current threshold of notifications is safer in this case. If we will implement a timeout threshold, it would probably require more frequent pings. Please correct me if I’m wrong, and are we on the same page with your query?

    When it comes to errors on the site, Uptime always points main error, and we have that covered in the docs as well: https://wpmudev.com/docs/hub-2-0/uptime/#downtime-issues, and that can help to narrow down what could be the main reason.

    Could you give us more details on who this timeout threshold should work, so that we can be on the same page and so that we can consult this with HUB developers?

    Kind Regards,
    Kris

  • Tony G
    • Mr. LetsFixTheWorld

    Please pardon my jumping in here with notes on this topic that’s “near and dear to my heart”.

    1. Slow doesn’t mean Down.

    2. No longer as-slow doesn’t mean “back Up”.

    3. I would really like a REST mechanism that allows my server to ping the WPMU DEV server with messages for “Scheduled Restart” and “Restart complete”, or a simple “stop”/”start”. If the ping pauses for some time period and/or until the “start” message is received, we would avoid a separate email, hub notifications, and other alerts for every site on a single server that is doing normal maintenance.

    4. Rather than a Down/Up notification for slow responses, I’d prefer a single email for “consistently slow”. Distinguish between event and pattern. ‘Down’ and ‘Up’ alert emails don’t help if we can’t associate them with cause and effect, or follow-up with testing to know better or worse.
    The following Response Time screenshot shows an average of 1257ms for a period of time, followed by a signficant drop. Yay. I’d like to be able to use that data to hopefully correlate events to determine what made the response time better or worse.

    [attachments are only viewable by logged-in members]

    5. I’d like the ability to graph data across a multiple sites on a single server (or any grouping of sites) so that I can easily see when many sites experience an event. This can help to identify periods of high activity on one site which are not on another – which is good information about individual site durability compared to a simple down/up metric. Or, if many sites in a group have a similar response-time change, that’s another metric we can use to understand an issue with a server, network, datacenter, geo-location, etc.

    6. I’d like the uptime processes to hit my servers from different global locations. If response time is consistent (not “the same”) from multiple sites, we know the issue is on our side. If the respone time is inconsistent, we know there is some internet-related issue that is causing a response delay. This tells us the problem isn’t on the site or server … and should preclude a slew of email notifications.

    7. The following pattern tells me that response time might be related to the time of day (somewhere in the world). Is that on my side in the USA? Is that because of international telecomms from Australia or wherever the servers are that run Uptime? We don’t know because we don’t have enough information about where the servers are. (And it’s never clear to me whether time is in UTC, my admin time, server local time… The Alert info is in UTC but that doesn’t seem to correlate to the Response time chart.There should be a timezone toggle. )

    [attachments are only viewable by logged-in members]

    8. And this is fascinating : For a few days this month response time for one site was much better than for the rest of the month … and yet that doesn’t correlate to another site on the same server. Why do I need to connect these dots?

    [attachments are only viewable by logged-in members]
    [attachments are only viewable by logged-in members]

    9. I would really Really REALLY like WPMU DEV to provide push or pull REST access to the response time data. We could use that for cross-site analyses, server log correlation, and other diagnostics which WPMU DEV will never implement. This would eliminate the need for the company to have to guess at what people want, and would allow us to be creative about using the data for intelligent management … and share our tooling and insight.

    10. Allow us to add notes to datapoints in the Response Time graph. This might allow us to note when we take some action in response or in remediation, and see performance changes that might be related to specific actions.

    11. Keep data for a year. One month isn’t enough to recognize trends when we’re working with many sites.

    If you can’t provide tools, provide data and foster an industry of tooling based on your platform! I’ve been saying this for years and we’re still here with …

    “Give us more details so we can talk with HUB developers”. C’mon. If they can’t figure out all of these things and pro-actively implement them … over a period of years … there’s no chance that one-off discussions are suddenly going to spin off useful enhancements.

    By providing better access to data via REST and hooks, significant burden is taken off of the company to make decisions about features. Be our partner. Stop trying to be the one-stop software/saas provider – it’s not working.

    I used to work at a company where developers with no real world experience became managers and decision makers, simply because they moved up with time and promotion. But they never actually used the products in the field and didn’t really understand what the existing and prospect audiences needed. I always have a similar impression when I look at some of the WPMU DEV offerings. Please – get people in there who represent the target audience. And trust the ecosystem to build on the core. These are concepts already central to the WordPress industry itself. Learn from that.

  • Patrick Freitas
    • FLS

    Hi Tony G

    I hope you are doing well.

    1. Slow doesn’t mean Down.

    That’s correct, but as you know the system has to timeout and then attempt again, some slow performance can cause false positives as we will ping the site few times and all of them will timeout, in such instance Uptime may consider as down event.

    In any case, a slow response is always something to be investigated.

    2. No longer as-slow doesn’t mean “back Up”.

    In the same way the above, uptime mainly hear the HEAD request , timeout would be considered down, 200 event up any non-200 down.

    3. I would really like a REST mechanism that allows my server to ping the WPMU DEV

    I understand it may help, but even in maintenance period it would be good to know the down and up event so you can for example monitor if the site is up for 3rd party system as well and calculate how long the restart took. I know it depends on the approach and company to company. I could find a similar request created to our developers, I added the Rest API suggestion on the task as well.

    4. Rather than a Down/Up notification for slow responses, I’d prefer a single email for “consistently slow”.

    That would make sense, but as I mentioned if we don’t receive the back response we can’t fully say it is slow.

    What could work here is to have a threshold, example if you response time increases to 1000ms it notify as “slow response”

    5. I’d like the ability to graph data across a multiple sites on a single server (or any grouping of sites) so that I can easily see when many sites experience an event

    Something similar to security center maybe? https://wpmudev.com/hub2/security-center/antibot-global-firewall where you can control all sites into a single place.

    I believe that could be a nice idea to have some data comparing the sites. I’ve created a feature request to our product team.

    6. I’d like the uptime processes to hit my servers from different global locations.

    We ping the site from US and Germany

    https://whatismyipaddress.com/ip/34.196.51.17
    https://whatismyipaddress.com/ip/35.157.144.199

    We do have plans to enhance the uptime experience which will potentially bring more locations, we can’t give an estimated time yet, currently HUB team is working on Standalone client management https://wpmudev.com/roadmap/#upcoming-the-hub-2-0

    7. The following pattern tells me that response time might be related to the time of day

    That is the average between our pings using both locations;

    8. And this is fascinating : For a few days this month response time for one site was much better than for the rest of the month … and yet that doesn’t correlate to another site on the same server. Why do I need to connect these dots?

    The response time gives you a picture of what is going on in the website, the server is only one variable that could affect it, we need to consider the plugins, caching, background processes etc, it is not able for Uptime to determine it but the graphs can help you with specific timeframe, I usually relay on that along to slow log and access logs.

    9. I would really Really REALLY like WPMU DEV to provide push or pull REST access to the response time data.

    I understand your wish on creating tools and enhancing your workflow, I also agree some API would be interesting for technical people like you, we can’t say “never will implement” but unfortunately it is not on our upcoming plans.

    10. Allow us to add notes to datapoints in the Response Time graph

    That would be great, it is not possible with the current uptime setup, we save / fetch the data from lambda / API server, so for the notes we would also need to map an extra table which at this moment we don’t have.

    Internally, HUB is just an interface for all the data, different to some other services which we have specific tables / database.

    However, I found an old feature request for that and added this as extra vote to our product team.

    11. Keep data for a year. One month isn’t enough to recognize trends when we’re working with many sites

    I couldn’t find similar existing requests but I forwarded it to our product team as a fresh feature request.

    C’mon. If they can’t figure out all of these things and pro-actively implement them … over a period of years … there’s no chance that one-off discussions are suddenly going to spin off useful enhancements.

    All the requests are internally escalated and we keep track to whenever similar requests are created we add as extra vote to our product team. We ask for more information in some of the requests so we ensure to escalate the proper one.

    Best Regards
    Patrick Freitas

  • Tony G
    • Mr. LetsFixTheWorld

    Patrick – I’m almost always completely satisfied with your responses – whether the answer is yes or no, the answer is always well-researched and considered. That’s all I ask, and I thank you sincerely for your diligence.

    I’ll add a related note to the pile. A site in maintenance mode is not “down”, it’s “not available to the public”

    This site is in maintenance mode as I’m developing plugins. It hasn’t been “down”. It’s not returning the response expected, but that doesn’t mean it’s not responding. These are different concepts.

    [attachments are only viewable by logged-in members]

    And despite this motley crew of response codes, the site is always up and consistently in maintenance mode.

    [attachments are only viewable by logged-in members]

    I think I can summarize that Uptime was a great v1.0 and has potential for a new round of improvements for the real world. The world is not simply “down” or “up”. Response time may be slow. A system may not respond to every request with a 200. There are predictable reasons for delays and unavailability which preclude streams of emails and other notifications. WPMU DEV has a tendency to publish MVP features to check boxes for sales/marketing, and then not go back to them. Sometimes that v1.0 is good enough, and there are some awesome exceptions to the pattern. Uptime is a mature component that I believe needs a fresh round of imagination for a v2.0.

  • Luis Soriano
    • Staff

    Hi Tony G

    Firstly, I really appreciate you taking the time to share this feedback. By the way, I’ve noticed you’ve made quite a few valuable contributions around here, and it’s clear you really care about improving our products and services!

    I’ll add a related note to the pile. A site in maintenance mode is not “down”, it’s “not available to the public”

    You make a great point that a site in maintenance mode is not technically “down” in the sense of being completely unresponsive. When a site is in maintenance mode, it is actively returning a response, often a 503 status code, to indicate that the site is temporarily unavailable. This is intentional behavior designed to let both users and search engines know that the site is undergoing updates or maintenance, not that it has failed entirely. From a technical perspective, the server is fully functional, just configured to restrict access temporarily.

    That said, using the term “down” is a simplified way to communicate to non-technical users that the site cannot currently be accessed. Most people associate “down” with any situation where they cannot reach a site, regardless of whether the server is running, in maintenance mode, or facing temporary performance issues. The shorthand avoids confusion for general users who might not be familiar with HTTP status codes or nuanced server behaviors, but on the other hand I understand the technical accuracy you are expecting.

    In cases where a site is returning a 503 error, the root cause could be more than just maintenance mode, just to mention a few:
    – Server overload.
    – Application crashes
    – Resource limitations
    – Backend service failures, etc.

    All these scenarios can also trigger the 503 response. Similarly, misconfigured security rules or firewalls can create the same effect, making a site appear “down” even though the server is online and sending responses. Maintenance mode simply uses this same status code to clearly indicate an intentional, temporary restriction, which is the reason the term “Website Down” is used.

    Ultimately, both perspectives are valid depending on the audience. From a developer or technical standpoint, saying “not available to the public” is more precise and reflect that the server is operating as expected. From a user-facing perspective, however, “down” is a quick, accessible way to describe the experience of not being able to reach a site, which allows comunicating effectively to those less familiar with server operations.

    I think I can summarize that Uptime was a great v1.0 and has potential for a new round of improvements for the real world. The world is not simply “down” or “up”. Response time may be slow. A system may not respond to every request with a 200.

    You’ve summarized it well—uptime monitoring absolutely has room to evolve beyond the binary idea of “up” or “down.” Real-world performance isn’t always that simple, as slow response times, intermittent failures, or unexpected status codes can create a poor user experience even if a site is technically “up.”

    It would be great if you could share more details about the types of improvements or insights you’d like to see in future versions on this area. Keep in mind though, Uptime complements or adds status codes already to go more into specific details as you noticed in the reports. But since the tool can not go deeper or access an unavailable site, it may be quite challenging to report the specific reason for a 503 error for instance.

    WPMU DEV has a tendency to publish MVP features to check boxes for sales/marketing, and then not go back to them. Sometimes that v1.0 is good enough, and there are some awesome exceptions to the pattern. Uptime is a mature component that I believe needs a fresh round of imagination for a v2.0.

    We really appreciate your honest feedback, and we completely understand where you’re coming from. But consider launching new “MVP” versions to get features into members’ hands quickly, this approach is meant to start a conversation and gather valuable feedback form our members. Our goal is always to iterate and improve based on real-world usage rather than leaving features stagnant. Uptime, like all of our tools, remains a priority for continued development, and your comments are exactly the kind of insight that helps us shape its future direction.

    In fact, much of our roadmap and many of the updates (For instance moving to HUB 2.0, Client & Billing, SmartCrawl/Hummingbird/Smush in 3.+, and Defender growing to 5.+ ! ! ) we’ve released over the years have been directly driven by member suggestions and feedback(Just in this ticket I confirmed Patrick added several features requests for our team to be aware).

    We’re committed to refining existing features, not just releasing new ones, and you can be certain that tools like Uptime will continue to evolve and remain useful based.

    Thanks again!

    KInd regards

    Luis

  • Tony G
    • Mr. LetsFixTheWorld

    Luis Soriano I thank you sincerely for your time and consideration.

    I’ve been a client of WPMU DEV for 10 years. I’ve gone from being a cheerleader to being rather grouchy here, because despite the warm welcoming of change requests, some products like Uptime do go for many years with the same obvious issues unaddressed. But I encourage you and the team to pay more attention to the message than the messenger – I just growl, I don’t bite. :dog: :slight_smile:

    With regard to user friendliness, I see two aspects to this.

    1 – About the UX for site visitors: Uptime doesn’t have to hit the front-page of a site to determine site status. It would be trivial for Uptime to hit some page other than the homepage, like domain.tld/uptime-uri. That takes UX out of the equation.

    “But if we hit a different page the response time will be different”

    Yes, but “UP” “Time” is not a performance tool, by definition. And it’s erroneously reporting slowness/performance as up/down … which is exactly the problem being discussed. If “UP” “Time” is a performance tool, please rename it to something like “Hummingbird” so we don’t get confused. If “UP” “Time” gets a 200 after a long period of time, send us Marketing material for how to fix that problem with Hummingbird, but don’t report the system as being down.
    If you get a HTTP timeout, OK, that’s a status 50x-type error, we’re not just slow, we’re not “UP”. That’s worthy of reporting or other processing by a tool called Uptime.

    2 – And UX for site admins looking at Uptime reports: I think we can be trusted to deal with the difference between Up/Down and Fast/Slow. Just tell us what it is, accurately.
    What other UX is there to be concerned about? We’re admins here, not site visitors.

    About detection of the nature of a service interruption…

    When we put a site into maintenance mode or restart the server or perform some other operation, we can script that endpoint to return different content with a 200 code (exactly what Branda does for maintenance or “coming soon” mode), to redirect with a 302, 303, or 307.

    A 503 response may include a Retry-After header (3600 = 1 hour) to indicate when the server should be available again. Retry-After is a RFC-supported standard, respected by common crawlers. While it would be cool if we could set that from The Hub or individual accounts, that’s not necessary. Get the Uptime clients to check for the header when it gets a 503 and disable further queries for the specified time period.
    “I’ve never heard of that!” Well lookie here, Yoast provided the WP code four years ago:
    https://yoast.com/http-503-site-maintenance-seo

    Not sure why you’re getting a 503? Check the X-App-Reason header (… or any other header you define) :
    X-App-Reason: planned-maintenance
    …planned-system-restart
    …wordpress-updates
    …out-to-lunch
    It doesn’t matter what the reason is. But if there is a recognized reason, like planned-system-restart, Uptime should know that a restart “should” take a few minutes, so it should try back soon. Or whether or not the reason is recognized, fall back on the Retry-After header.
    It shouldn’t need to be said: Report the X-App-Reason in Uptime reporting and allow for filtering and sorting on the tag.

    “But if the server is down we won’t even get a 503”

    That’s why I proposed this above…

    > 3. I would really like a REST mechanism that allows my server to ping the WPMU DEV server with messages for “Scheduled Restart” and “Restart complete”, or a simple “stop”/”start”. If the ping pauses for some time period and/or until the “start” message is received, we would avoid a separate email, hub notifications, and other alerts for every site on a single server that is doing normal maintenance.

    Again, this can be sent from The Hub or an individual site, or anywhere else, to advise Uptime that the site won’t be available.

    “We don’t want to expose an inbound HTTP endpoint”

    Sigh. OK, here’s another option, not completely effective but better than nothing…. The special “uptime-uri” web page can return a 302 to redirect to another page that includes a X-header that is a message to Uptime, like:
    X-For-Uptime: scheduled-maintenance-0400-0500
    (The redirect isn’t necessary but it avoids having to check headers on every code 200.)
    Now Uptime can process that header and schedule for no pings between the specified time range.

    “But But But … we don’t want to follow a 302 and we don’t want to parse a 503 and we don’t want to expose inbound endpoints and we don’t want to …”

    Oy vey! OK, try this… Any HTTP server can return any status code and any HTTP client can process any status code. So when you ping a page, you’re going to check for a code 200 or a 50x … check for 561, 562, or some other code that isn’t already defined, we can send you that code and let you know why applications are down or why they will be down. Or check for a 2xx code that says “we’re up now but not for long” or a 3xx code that says “you don’t need to redirect now, but check this other URI later”…
    Yes, it’s a very bad idea to use custom HTTP status codes, but if you guys don’t want to do any of the normal things, I have to put the Abby Normal jar on the shelf. (Anyone get that?) :eyes:

    Part of the issue here is not just detecting Up/Down-time, but notifications. We get emails for every event. With 50 sites on a server, one reboot = at least 50 emails for Down and maybe 50 emails for Up (sometimes Ups don’t follow Downs) … Or (cube root of 6… square the remainder… 3xPi … account for wind speed … and the answer is roughly…) about 100 emails for a single (scheduled?) event.

    Before sending an email, try checking a different location. Ask us in The Hub to provide a secondary URI for Uptime to check when it detects a 503 or long response time or something similar. At this other URI on a different server, we can return a simple .txt file or (not again!) an HTTP header, that verifies “yeah, we know that server/site is down/slow, don’t email us”.

    Summary: There are many small efforts that can improve Uptime significantly. Try them one at a time. Get some feedback … move forward … iterate. There doesn’t need to be “one” solution that fits all scenarios … and IMO it’s a bad policy to continually try to do that. Try multiple solutions that allow us options to choose. Because in this case of events and notifications, everyone wants something different – you’ll never please everyone with a single solution. Whatever you do, I’ll ask that you allow for user-defined hooks, overrides, cut-offs, because some people won’t like some changes, so they need to be able to change it or disable it.

  • Patrick Freitas
    • FLS

    Hi Tony G

    Thank you for the further details, as always all great points and seeing you around for those years I know you always bring good points to the discussion.

    I do agree we have a good room for improvements, for example the “adding notes to downtime” is something that I personally brought to our developers in past as well, it is not yet possible due how we structured the service but that is for example we have been considering.

    There are some other background discussion on how we can improve the uptime and I made sure to share this ticket in all related tasks.

    In case you have any further suggestion feel free to share with us in this ticket or by creating a specific feature request, ( not the only metric ) but members vote is also an important metric we track when taking decisions on what’s next, each request having its own thread can help to collect those votes.

    Best Regards
    Patrick Freitas