Show errors related to managed backup

I have two sites setup for Automate where plugin updates failed because “supposedly” managed backups weren’t setup. OK, so I setup (again?) and run a first backup from Hub for one of the sites. The managed backup fails. There is no error message. It just shows that I need to create a backup.

I have not yet started to look deeper, and I have only done this for one of the sites. I’ll do my job to discover and remedy the issue. Yes, I will certainly look at server logs, I’ll check the list of WPMU DEV IPs approved through the firewall. But it would be helpful if the managed backup process provided some kind of messaging from the Hub so that I don’t need to go to the site to figure out what happened. I mean, did the backup fail? Did Dev fail to connect in to retrieve the file? Did it connect in and not find the file?

As a part of this, I need to understand what Hub apps do when failures like this occur. It looks like Automate will stop updates if backups fail, and backups will stop running if there is a backup failure. We get a single email when one of these functions is toggled, but it’s easy to lose that in the mass of emails we get from Dev.

Is there Hub RSS to which we can subscribe so that we can automate the parsing of notifications and do our own reporting, warnings, etc. I mean, if the Hub turns off Automate or Backups, I’d prefer to get a warning in Slack or Skype or SMS. I can handle that part of it, but I need the raw data to parse.

Thanks.

  • Predrag Dubajic
    • Support

    Hey Tony G ,

    I have some questions about your request if you don’t mind, so I could understand this better.

    Are you looking to have actual errors that happened during the backup, for example if server timed out and you want to see why, or just to see at which stage the backup failed, backup process, upload, etc.
    If it’s about the first one I believe that this is quite limited by the server itself as the Snapshot logs depend on the information that is returned from server when it fails, and unfortunately, in most cases the information provided is not telling us much about what exactly happened and where the process failed.

    At the moment we don’t have any Hub RSS that I’m aware of but I passed all of the suggestions and explanations to our devs so they can check this out and see if something could be done about this in the future.

    Just a suggestion in the meantime, have you considered using tags or folders in your email (depending on the client) and have a specific section for emails that include errors and filter them easily from other reports?

    Best regards,
    Predrag

  • Tony G
    • Mr. LetsFixTheWorld

    Predrag Dubajic – Sincere thanks as always for your kind and informed responses!

    Let’s keep this simple. I want whatever information is immediately available from any event. So if there is an error message to be obtained, I want it. If the backup got an integer return code, I want it. Just give me something other than “failed” from the process that was running the operation, because processes will rarely die with no messaging.

    I don’t know exactly how The Hub does its various operations. It’s a black box. One step toward better reporting is to explain exactly what each process does so that we can learn what we need to do to intercept error messages, until Dev builds in better handling.

    For example:
    – Exactly how does Automate initiate an update of one or all plugins? Is this done by the WPMU DEV plugin? If so, can someone point to the code lines? ( The idea here is “Please don’t make me work too hard in the dark” ) Is there a CLI operation? I might be able to add a capturing of STDERR or STDOUT … and then I’ll wonder why that’s not being done already. Is this being done by wp-cron?
    – Similarly, exactly how does a Managed Backup get triggered in a site from the Hub? Or does the Hub just set scheduling and leave it to the site to upload? Exactly what mechanism is used to retrieve the managed backup file(s)? Is a push initiated from the site server? Is a pull performed from the Hub? Again, is there any capture of STDOUT or STDERR?

    How about blogging some of that, or (OMG!) documenting it so that we have some idea of how things work?

    When The Hub tries to do something on the site and it fails, why isn’t the first action to do a ‘tail’ command on the site errors file? Why can’t the Hub put the site into Debug mode for us so that we can get data from the wp-debug file?

    Why isn’t a hook created for every failure event so that Dev and or the site admin can write code that captures and handles the events? We get nice emails on failure. Is that hardcoded or a hook? Let’s hook more handlers in there. I would dedicate one of my (thank you!) free Hosting accounts to do nothing but process hooks from Hub events … success and failure results. I just need the data.

    Or forget hooks. Just save all available info in a text log somewhere. How tough is that? Rotate logs every 30 days. RSS was just a suggestion as a mechanism to retrieve such data. If I were using one of my sites for error reporting, I’d generate a post for each event … voilà! RSS!

    I always feel like I need to offer obvious daily scenarios for Dev to take back to those wise managers and developers for their multi-year consideration. Look, as a developer I get it: We write code that should never fail. So error handling is always done last (like documentation) and it’s always responsive …. Just handle the stuff that we know goes wrong (when we get field reports) so that stuff that actually does work all the time isn’t burdened with error handling code. But we’re into The Hub 2.0 now and we’re not getting information from The Hub 1.0 that was just as obvious on the day it went production.

    As to filtering inbound email. Yes, certainly this is something that needs to be done. Unfortunately this is a reactive solution that is being forced upon site admins as data consumers because the data provider (Dev) still insists on using email as the only reporting channel. I’m sure you see the issue here – Dev made poor development choices that now compel us to create client/server filters to manage. I would much rather that Dev start thinking more 21st Century. Rather than formatting data into emails for noobs who like such things ( “Strewth mate!” ), offer data in a repository format which we can retrieve and format to our liking … and you can pull from that too to send email to the noobs.

    As you can see, I’m frustrated that the Main audience for WPMU DEV is the entry-level newcomer to site management and WordPress, the tier-1 person who wants and needs all of the basics. The rest of us at tier-2 are way beyond that, and the simplistic solutions (or lack thereof) are inadequate. With the WPMU DEV offering anyone can easily create and maintain a single site, maybe two, five… But with this offering a significant amount of attention must be dedicated to pick up where Dev leaves off. Someone with a Lot of sites can easily find themselves dedicated to chasing down issues with Dev software and services, not on our primary businesses. For those of us with other things to do, where site management isn’t the goal, the sites just support other things we do in the world, the maintenance burden that remains is significant.

    As the most basic example in the context of notes here, something didn’t work. What happened? Just tell us. Why is “I need to know what went wrong” such a revolutionary concept? The result of having no information from Dev automation is that each of us is compelled to figure out our own pattern of problem resolution: The tools designed by Dev and purchased by us client/members, to help save us time and effort, create a new need for procedures to deal with problems when those tools don’t work. We’ve traded one problem for another. That’s not progress.

    I love WPMU DEV. I’ve said before and I’ll say again, our yearly membership is a line item as a cost of business that is never questioned. But when Dev processes fail and my time is required to intervene, now it’s costing a lot more time=money. PLEASE do not be a cost center. Please be less foreground in my life and strive to be more in the background. You’re doing a better job when I need to be here less to explain at length why I need the Hub to “Show errors related to managed backup”.

    Sincere thanks as always and best regards to the fine team at Dev!

  • Predrag Dubajic
    • Support

    Hi Tony G ,

    After my previous message, I did discuss this with the devs shortly and we’ve added improvements to notifications on the to-do list so it can be looked into further and see what would be the best approach regarding this.

    Regarding the questions about Automate and Snapshot, while I do appreciate the details provided, I must be honest and say that I can’t help much regarding the background stuff done in them, so I’ll ping our devs and see to have someone savvier to provide you with further info :slight_smile:

    Best regards,
    Predrag

  • Leonidas
    • Dev Stuff

    Hi there Tony G

    and thanks for the feedback :thumbsup:

    As the Snapshot lead dev, let me focus on just the Snapshot part of the Automate integration since, from your response, I feel like its the main (and most undocumented) part of your issues :slight_smile:

    Why isn’t a hook created for every failure event so that Dev and or the site admin can write code that captures and handles the events?

    As the most basic example in the context of notes here, something didn’t work. What happened? Just tell us. Why is “I need to know what went wrong” such a revolutionary concept? The result of having no information from Dev automation is that each of us is compelled to figure out our own pattern of problem resolution

    Unfortunately, when backups are failing with Snapshot V3, the most often culprit behind that is server resources being exhausted. When that happens, the handling we’re able to do post-mortem is minimal, especially considering the fact that V3 backups are plugin-driven.

    That being said, the brand new Snapshot V4 is an API-driven piece of software, which essentially means that we are now better suited to handle failures and I think you’ll be pleased to know that we have quite the plans to enhance our reporting, not only for Automate, but for general Snapshot use too.

    For example, we have already initiated a plan to enhance our current email reports for Automate having failed due to a failed backup attempt, to state the exact stage where that backup failed, along with better instructions on what they user can do to make it work the next time.

    Now, just because we’re talking about creating backups at PHP-level with a WP plugin (when not on our hosting, as with sites on our hosting, Automate triggers our powerful and most-reliable hosting backups), there’s always a chance that upon failure, a manual inspection could be needed (either by the user or by our support team). But, with v4:
    a. we employ a more sophisticated backup solution that should have significantly greater chances of not facing such issues.
    b. in the event that backups do fail, we are now able to catch and report that in a better and safer way (like mentioned above) and our plans for the future sure do cover that.

    The v4<->Automate integration will come with the v4.0.1 release that’s in the last stage of our QA testing and should be out really really soon.

    Hope that covers some of your concerns :slight_smile:

    Kind regards,
    Leonidas

  • Tony G
    • Mr. LetsFixTheWorld

    Friends – you have done everything I ask – You have listened to the concerns and you consider them in your ongoing development. Thanks.

    I do understand that server-side issues affect the processes. If a backup consumes excessive CPU/RAM in a shared host, they might have a background process that does a kill -9. The process that is killed doesn’t know what happened – it’s just gone.

    However, knowing that this is how it works, breadcrumbs can be left in the file system or database. Let the backup set a breadcrumb before it does something, then clear the flag when done. When all operations “should” be done, follow-up to see if any of the flags were left un-cleared.
    Example: (Ignore syntax errors for pseudo-code)

    In Snapshot:

    do_backup() {
    clear_all_flags();
    save('backup_running',true);
    save('backup_database',true);
    backupDb();
    save('backup_database',false);
    save('backup_files',true);
    backupFiles();
    save('backup_files',false);
    save('backup_running',false);
    clear_all_flags();
    schedule_follow_up(now);
    }

    Now in Automate cron:

    schedule_backup(time);
    schedule_follow_up(time+30);

    This is scheduled as a checkup after the backup should have completed.

    follow_up() {
    if(follow_up_already_done) { return; }
    if(!get("backup_running")) { return; // last backup was successful }
    // last backup flag was not cleared
    if(get("backup_database")) { 
      handle_failed_backup_database(); return;
    }
    if(get("backup_files")) { 
      handle_failed_backup_files(); return;
    }
    //... what other flags are still present to indicate something failed?
    }

    So after a backup is run it’s easier to know where it failed …. of course, not Why it failed. With try/catch blocks around individual operations we can get info about issues that are not catastrophic. The above only applies to these catatrophic issues where the backup fails, the Hub returns from Snapshot back to the list of sites, and we have no idea what happened.

    I’ll leave further details to you guys…

    Thanks for your time.

  • Leonidas
    • Dev Stuff

    Cool thinking and we have implemented similar tactics in the past for logging stuff (not always doable as there are lots of other factors at play in some cases).

    Now, with the API-driven v4, there won’t be a need to do so, as we’re able to catch the exact stage where servers are failing during backup and report that, but cool thinking nevertheless :wink: