CartonCloud's auto-scaling to meet demand

Posted by:

CartonCloud

When client demand on web applications increases at peak times such as the lead-in to Christmas or Easter, many web services become slow, or in extreme cases, inoperable (think Ticketek when Adele tickets are released).

In the logistics industry, our peak periods are Christmas, Easter, Black Friday, Singles Day and other high-sales periods. At these times volumes of orders, both in the warehouse and on the road, go up significantly, sometimes by 100% or more - this means significantly more traffic and load on the servers handling everything.

Auto-scaling allows systems to deal with this additional traffic automatically, by increasing the number of servers (or power of servers) as demand increases.

Over the last 6 months we've made major changes under the hood to allow CartonCloud to auto-scale, and we're proud to announce that in the last 2 weeks, our team have set up auto-scaling for both our web-facing and background processing servers (servers doing background tasks like indexing for search, and sending POD emails).

This achieves three things:

Maintains speed when under heavy load (traffic), automatically. Previously an engineer would need to increase server resources manually, which may take an hour or more to complete, too long to handle a traffic surge.
Increases redundancy (servers are now disposable, if anything fails on one, it's simply terminated and another is started automatically). No time spent trying to figure out "what's wrong?" with a particular sever, just have it destroyed and fire up another.
Increases reliability. Rather than the application slowing down and then becoming non-responsive, capacity is added to meet demand preventing end-user issues. Without this, a single server dying (which will normally happen if it gets overloaded) leaves the other servers to fend for themselves, worsening the problem and potentially leading to a complete outage.

Above all else, we're focused heavily on application speed and reliability. We know that any slowdown on a page (even just 0.20 of a second) makes a huge difference to how quickly our users can get their work done (this is why we build Rapid Packing for example!). We continuously monitor our servers, end-user load-times, and watch closely for both reported and unreported errors to ensure CartonCloud is up, running and very quick to load.

Technical Information for those interested:

Back in early 2015, when CartonCloud was just an in-house system used by Roving Logistics, everything ran on a single server. Web requests, background tasks, everything was on a single server. This worked fine as request volumes were easily predictable, and the server was "over-powered" for the task at hand.

In late 2015, with several clients using the system, we shifted to AWS and at the same time, spread the infrastructure across two web-servers, a background (processing) server, a file server, plus an RDS database. It was a fairly heavy step-up in terms of cost (5 small servers rather than 1 medium sized server), but it certainly gave us control over independently controlling resources over each of these pieces. Capistrano was use as the deployment tool and Puppet for server management.

The structure then largely remained the same (web server numbers / sizes were continually increased to meet demand) until the last 6 months in which the following changes were made:

Deployment is now performed directly from Bamboo.
Servers are no longer configured by Puppet but simply cloned from an AMI base.
Background processing jobs are now run via API rather than in-build cron scheduled.
Both web-facing and background-processing servers converted to auto-scaling groups.

Interested in joining? Check out our current Careers.