Social network: Disaster recovery case study

This is from a real event which happened in Q1 2017
Some names and datapoints have been redacted for privacy and security

We received a phone call early in the morning asking if we could help to move a site to a different hosting provider as they were having trouble with capacity at the current hosting provider. We setup a meeting for later in the week, but then 3 hours later we received word from the client that they were in urgent trouble and needed our help ASAP.

Publicity: Not always a good thing!

This social network had just gotten a massive PR boost - celebrity endorsement had led to thousands of people hitting the website wanting to sign up. Unfortunately, this massive influx of people in such a short timeframe resulted in the server crashing, loss of data, and of course there were no backups anywhere. The webhosts had done as much as they could, but the site was still offline with a corrupt database.

Rescue stations

Our engineers got access to the server and quickly identified the broken systems. We started doing low-level repairs of the database and began to look at the software architecture. By analysing the server logs and software code, it was clear that the social network was simply a victim of its own success - new users signing up to the site simply overwhelmed the site's monolithic single-threaded codebase - it just wasn't built to scale with an increase in load.

We began patching the software so that new user registrations would work again, and started deploying replacement servers on a new cloud platform. We deployed the newly-fixed site onto new hardware and managed to recover 100% of user data from the old system - all in little over 48 hours.

Debriefing

Once everything was back up and running, we setup a meeting to debrief so the client could understand exactly what went wrong, and how to avoid the same problem with the re-launch. We had monitoring services setup so that we could evaluate the performance of the replacement servers, which handled the re-launch of the site without crashing, and gave the client peace of mind for the future.

How Can Switch Systems Help You?

Pinpointing problems, diagnosing bottlenecks and scaling your project to new heights are just some of the things our engineers love to do. Why not give us a call?