[Uptime Monitor] Bring back the old Uptime

0

The new version of Uptime gets triggered very easily and I’m getting bored of the false alarms for sites that have “went down”.

Every single day since the launch of the new version I get alerts that a site is down. Meanwhile, a third party monitoring service reports nothing – because the sites are in fact online.

I’ve discussed this issue with the support in a private ticket but we couldn’t find a solution. I know that “you get better results” etc but as an end user I’m getting frustrated by the number of these false alarms and I’m considering replacing Uptime.

Please acknowledge the problem and find a solution or bring back the old version that worked fine… at least from my point of view.

  • Adam
    • Support Gorilla

    Hi Isidoros Rigas

    I hope you’re well today!

    The “old version” cannot be brought back and there are multiple reasons for this, including some infrastructure-releated changes. I’d also say that it’s highly “debatable” if it really was better. I mean, I fully understand that it “might have appear to be working better from your point of view” or “for your needs” but in a long-run and “broader perspective” – it really wasn’t.

    But I do understand how some of the changes may be “annoying”. For sure one thing that you can do right out of the box is to set threshold for e-mail notification e.g. to 5 minutes or more and that itself would eliminate many/most of such short-time incidents’ notifications over e-mails. They’ll still be logged in Uptime log but at least wouldn’t “bombard” your e-mail.

    There is also ongoing work on further improvements – related to Uptime directly and indirectly – that should help in future so that should makes things better for you and other Members who might be seeing things the same way as you. We are working on it constantly to improve Uptime accuracy and overall experience.

    Kind regards,
    Adam

  • Tony G
    • Mr. LetsFixTheWorld

    Um, “new” uptime? I didn’t know there were any changes. I haven’t seen any recent notices, and I did used to get bombarded by notices every day. So, from my perspective it seems something changed for the better.

    Can you point us to announcements on this, along with any references to new switches?

    TY!

    • Adam
      • Support Gorilla

      Hi Tony G

      It wasn’t announced because it wasn’t any “major rewrite” – it’s not like “Uptime 2.0”.

      Just like with many updates to the Hub there is a process of updates (smaller or bigger) “in background” on our end. We’ll surely announce changes if there’ll be a change planned that is expected and planned to bring major changes in how it all works – by this I mean e.g. new features, other testing points, other/additional settings or options etc.

      This is more like “maintenance releases” with some “tech-level” improvements and code compatibility changes. Isidoros Rigas referred to it as “new Uptime” and I can understand it – I “followed that” in order not to complicate things more. But in reality it isn’t really a “new Uptime” (like e.g. Snapshot 3 vs Snapshot 4) but a set of “behind the scenes” improvements. But if a real “new Uptime” comes up at some point in future, we’ll surely announce it publicly.

      Kind regards,
      Adam

  • Isidoros
    • Eeriee

    Hi Adam and Tony G,

    It wasn’t announced because it wasn’t any “major rewrite” – it’s not like “Uptime 2.0”.

    Well, its developer, Aaron Edwards, literally called it “a major rewrite of our Uptime monitoring”. If that doesn’t classify as a “new version” I don’t know what does. Here’s his tweet explaining how it’s so much better.

    But I do understand how some of the changes may be “annoying”. For sure one thing that you can do right out of the box is to set threshold for e-mail notification e.g. to 5 minutes or more and that itself would eliminate many/most of such short-time incidents’ notifications over e-mails. They’ll still be logged in Uptime log but at least wouldn’t “bombard” your e-mail.

    How can I do that?

  • Isidoros
    • Eeriee

    I’ve complained about the Uptime service before and I’ll complain again because I simply can’t trust it anymore. I have received so many false alarms that I’ve started to ignore them and I actually missed a website that went down.

    Now I changed the domain registrar for nestoraterra.com and Uptime doesn’t stop sending me alerts that it went down. I’ve received 71 emails overnight regarding nestoraterra.com! The server is the same, the nameservers are the same and I haven’t made any changes to the website. I know the server is fine because I own it. And every time I visit the site when I get an alert it loads without any errors or delay.

    I also get at least once a day an alert that geekers.gr went down. Even though it didn’t. I simply ignore it but I have a real issue with the nonstop emails for nestoraterra.com.

    Uptime now sucks and I will stop using it. Is there a quick way to disable it for all sites?

  • Adam
    • Support Gorilla

    Hi Isidoros Rigas

    I understand your point but there is an important thing to note here:

    I check Uptime for the nestoraterra.com site and all those recent events are due to “Error: getaddrinfo ENOTFOUND nestoraterra.com” error. It means that Uptime tried to check the site but domain name was not resolved.

    That’s usually DNS issue and a check here

    https://www.whatsmydns.net/#A/nestoraterra.com

    confirms that at this very moment DNS is not yet fully propagated. This means that site would be available from some regions and from some it may still not be.

    Checking the propagation again, I can see it’s progressing and there is a chance that when you’ll be reading this, it may already be fully propagated, but it was not when I’m writing it.

    Especially some US regions were not yet propagated and our main checks are from US. If DNS doesn’t resolve, Uptime is not able to test it and getting the error mentioned above it naturally treats the site as down (and that’s correct).

    While I totally understand your frustration over the issue and I’m also well aware that Uptime has its “quirks” and issues too, it really isn’t as simple as “it doesn’t work” – in many cases problem is at a completely different level.

    But that’s just a general information. I understand your point and I’m really sorry that Uptime doesn’t meet your requirements.

    As for disabling it – it needs to be disabled individually for sites, there’s no “bulk disable”, I’m afraid.

    Kind regards,
    Adam

  • Adam
    • Support Gorilla

    Hi Isidoros Rigas

    Yes, of course we can investigate it but since this is a Feature Request ticket, it’s a bit out of the “support procedures” and it would be way easier for us to continue in a regular support ticket.

    I’m sorry for causing additional delay this way but could you, please, simply open a regular support ticket about that?

    You can do it here (don’t start chat, just open a ticket on forum, please):

    https://wpmudev.com/hub2/support/#ask-question

    and we’ll check both these sites – uptime logs and errors on our end etc – to find out what’s happening about those alerts.

    It would be very helpful if you would also additionally grant support access to both those sites in question when opening a ticket.

    To grant access to any given site:
    – login to that site and go to the “WPMU DEV -> Support ->Support Access” page in that site’s back-end
    – click on “Grant support access” button there.

    I’d appreciate it a lot as it would help us investigate it and help you better.

    Kind regards,
    Adam

  • Tony G
    • Mr. LetsFixTheWorld

    Most of my sites report a DOWN alert every day, and only once per day. I have an idea…

    This happens at the same time for all sites. I have the threshold set at 5 minutes to avoid “transient” issues so the frequency is much less than it used to be. UP alerts may or may not follow the DOWN alerts. It’s really just a lot of bogus reporting.

    I’ve recently done a major cleanup of email filters, standardizing admin addresses for all sites and reports, and other housekeeping that allows us to easily see when anything is happening. Now that I can see the pattern I think I have a theory.

    As Adam Czajczyk noted, it’s possible that this is being triggered by our secondary DNS which is sometimes out of sync with the primary. I need to migrate the functionality to a new server as a part of normal maintenance to get it back in sync. I think each time Uptime and other WPMU DEV services poll sites, there is a new DNS check and it’s random whether the primary or secondary server is queried. IF the query happens to hit the secondary/ns2, and IF that server happens to be out of sync at that moment, then DNS won’t resolve and DEV reports it as a functional/application error – not as a problem with reaching the site.

    I know this is an issue because we’re had the same issue with email. Our NS2 is the same as MX2, and email bounces back to a sender if their MTA randomly resolves to MX2.

    Of course if checking for an UP system and DNS doesn’t resolve, yeah, for all practical purposes the “site” is DOWN – mea culpa. But that issue with not noting that it is back UP is something I can’t address here. I have no idea how much time is between DOWN/UP checks – that The Hub might not report that a site UP, since it hasn’t been DOWN over some period of time.

    OK, to keep this relevant, it’s now feature request time:

    1) Please separate the processes of DNS resolution from actual site contact so that you can tell the difference between a bunch of Sites being down and a single DNS being down.

    2) Please cache IP addresses in a local DNS and use that as the Primary DNS before reaching out to other internet nameservers. Yes, this can be problematic : if we’re changing DNS then The Hub may be resolving to an old IP where the site is still up, rather than resolving to a new IP where the site might be down. I think the chance of this happening is so tiny that it’s not worth consideration until it’s actually noted as a real problem.

    3) In addition to caching the IP address of individual sites, cache the IP address of authoritative name servers. So for example, if my NS2 leads to a down server report, try NS1 from the local cache – don’t just don another DNS query and assume it will use a different name server.

    HOWEVER – please don’t cache an IP address in a local DNS and then avoid checking for an authoritative name server if your cached IP results in a failure. We have had this issue with mail – where a MTA (Microsoft mail servers in particular) cache bad IP addresses and then don’t refresh them! This results in their server expecting that a one-time failure will always result in a failure, so even if our NS2 or MX2 is down for just a few seconds, they never go back to checking for NS1 or MX1, they bounce all mail because they have it on record that they bounced other mail some time in recent months.

    Sorry for all of the detail here but I hope it helps someone who may be seeing the same kind of issues without knowing about how some of these dots can connect…

    • Adam
      • Support Gorilla

      Hi Tony G

      Thanks for suggestions!

      I’m not sure how the code behind Uptime works exactly (as I have no access to it) but one thing I can tell – it’s not as complex as separately checking DNS and other services. It’s more similar to simply doing a cURL request for HTTP headers only.

      Similar (not exactly the same, but close) as if you’d run “curl -I https://yoursite.com” command from the command line on your machine. The DNS resolution and all other connection related stuff happens on OS level. There quite likely is some DNS caching happening (like on pretty much every system/machine). But whether we can separately and purposefully control DNS checks in the process and IP/DNS caching and so on – without actually totally rebuilding the code – that I’m not quite sure of.

      But of course I’m passing this feedback to our Uptime developers.

      Best regards,
      Adam

      • Tony G
        • Mr. LetsFixTheWorld

        I was thinking about my suggestions for checking DNS. I might be wrong…

        On one hand, I still think it’s all valid, that we lose debug information by not checking the DNS separately, and distinguishing that from “site” uptime.

        On the other hand, I respect that the service which WPMU DEV is providing is simply to tell us if our site is accessible or not. When DNS is flaky, sites are not accessible – full stop. I can’t argue with that. And I don’t think anyone can claim that DNS errors are so chronic that this deserves much attention.

        So please do consider this as a geeky enhancement, but I don’t think it should be given anything higher than a low priority.

        Gotta be fair. :smirk_cat: