The case of the disappearing code

At Creatuity, we encourage our team members to work through an after action review process not just at the end of projects and goals, but also any time anything unexpected happens. Today we’re sharing a narrative from our Dev Team Lead, Tomasz Szmyt, of one such unexpected event that happened a couple of years ago. In a departure from our usual content, this story is really focused on the developers in our audience - see if you can guess what happened before you get to the “How did all the code disappear?” section. It’s a great example of an extreme edge case that it takes years of experience and work on hundreds of Magento projects to discover and then learn from.

Here’s the story in Tomasz’s own words…

During a beautiful late summer day, I’ve got message on Slack from our deployment manager:

“– Hey Tomasz, do you have a minute?”

I feared that something terrible had happened, because we’re used to messaging each other at Creatuity like this when something is going on we need help with. And this time - I wasn’t mistaken.

“– Something strange happened, I synchronized changed files to the new release directory, but files are missing in the current release directory. They’re placed one level up in the filesystem tree.”

I had a heart attack, thinking that all of the code had vanished from the production server. What’s more - we may have lost media - product images, etc., which would be even worse. We’re strictly using Git to track code in Creatuity so I knew it was safe, but media backups aren’t done too often (it depends on how frequently a particular website is being updated by our clients - once, twice per week usually, sometimes more frequently).

I instantly logged into the client’s production server and confirmed that all of the production code is indeed missing.

“This client utilizes a multi server cluster - maybe web2 will have the lost files”, I thought. I instantly jumped to web2, but the code wasn’t there either. “We’re screwed” - I was almost sure that we’ll need to use our backup of media made 2 or 3 days ago at best. But then I remembered that media and crucial configuration files should be available through NFS and bingo! There they were.

What rescued us from total panic and an extended site outage? Using rsync with the --links option allows rsync to recreate symlinks after sync, but will never touch a symlinked file or directory. So while the symlinks our deployments depend on were destroyed, media and configuration files located on the external NFS shares were untouched. Whew!

How did all the code disappear?

While executing the deployment scripts, just before syncing new files to the new release directory, our VPN reconnected (probably because of network problems). After that, connection through SSH to the production server was re-established. This disconnection and reconnection caused all session variables, such as the deployment version (i.e. “1.2.3”) to go missing. So rsync used “/var/www/releases/” as its target instead of “var/www/releases/1.2.3” as its target as it should have, causing all releases to vanish.

Someone might say - why are you not using dry-runs? 

But we are using them! This didn’t prevent this issue because --dry-run happened just before the VPN reconnected - at the worst possible timing.

Additionally, this web cluster automatically syncs files in “var/www” to all web nodes, so all of this instantly got applied on web2 server too.

How did we quickly recover from this?

We recreated a proper release directory and applied all of the code from our Git repository there (we’re normally applying changes on production files to not lose any changes that might be done as a hotfix or by external teams).

After that, we recreated the proper symlinks for media and configuration files.

Once that was done, the deployment went smoothly. We were absolutely stressed when visiting the site after that to perform basic QA, but luckily (or it is better to say - thanks to our infrastructure) everything was working as intended.

How will we mitigate such problems in future?

Having the VPN reconnect itself automatically is neat. We don’t want to cut off that functionality.

Instead, we’re storing full paths to directories in variables. 

This guarantees that rsync (and rest of the commands used during deployment) will fail with visible errors printed directly on screen instead of silently using the wrong target directory during a deployment.

Previous
Previous

Adobe Commerce Cloud Starter to Pro Migration

Next
Next

Amazon's 'Just Walk Out' Tech for Retailers