Hi I’m Russell, Software Development Team Lead at AO.com.
In this post, I’m going to share a recent rags-to-riches story (on an application-performance-theme) that my team were responsible for. I’ll talk a little about the application we started with, the challenges we had, how we improved things (a 40-fold performance boost!), and some next steps that we could take to further optimise things.
Where We Started
Like many IT organisations, we’re heavily invested in service-oriented (and to some extent micro-service) architecture. When a user visits our sales website, they’re not dealing with one monolithic application, instead our architecture is something like this:
We have a caching tier, we have a security and stability tier, we have a handful of thin UI-oriented applications, and a whole bunch of services underneath that drive discrete parts of the various UIs, for example product reviews, media associated with each product, faceted catalogue navigation, etc.
Now one of these services, specifically our reviews service, had been giving us some performance headaches. When I say headaches, I’m being a little kind – we would ideally like a service to have response times around the 30-60 millisecond mark, our reviews service was frequently hitting an eye-watering 1,000 milliseconds! When we need to determine how performant something needs to be, of course we’d consider what load the component is expected to handle. In this case, the load historically wasn’t especially high, low single-figure percentages of requests to our website went anywhere near our reviews service. (Of course it’s not great that a particular part of our site is dramatically slower than the core journey, but it was somewhat under the radar.)
Then one day folks started scraping our reviews data. And not in nice easy-to-block ways, but in bursts lasting up to an hour. Reviews service traffic would increase ten-fold, certainly enough that it was throwing the average performance across our entire site, and tripping all sorts of alert conditions. No longer under the radar, the time had come to do something.
Big, Normalised SQL Models With Heavy Load
Our reviews service was built many moons ago, on what was then an entirely standard architecture for us. We’d at once build an internal UI to manage the data, then we’d build a service to expose the data to our any front-end-applications that needed it. Some folks reading this might recognise straight away we’ve done something suspicious, we’ve build two different use-cases on top of a single service and single data store, but more on that in a moment.
Our internal management app is all about a flexible view of our data. Multiple ways of displaying things, sorting things, querying them, displaying slices of data for UI widgets such as drop-down boxes. This flexibility tends to fit well with a traditional fully-normalised SQL schema. It’s worth noting our internal app is also very low load (compared to our external products).
Our website-facing app however is almost the complete opposite – it’s absolutely not about flexibility, and instead is all about high load. This is where a fully-normalised SQL schema is a problem. It’s slow, REALLY slow, sticking together all the data from a dozen different tables every time we’re generating the data for a view.
With such different characteristics between our two reviews use cases, we clearly need two different ways of accessing the data. So here’s what we did.
De-Normalised View Stores to the Rescue!
First, we decided that what we had for our internal management app worked just fine. The app is low traffic, it’s internal-only, it needs that flexible normalised view, it can cope with just plain-ol’ mediocre performance.
Our customer-facing component on the other hand definitely needed something better suited. So our next choice was driven by the fact that product reviews are modified a heck of a lot less than they’re displayed. In our current architecture, we’re de-normalising our review model in our SQL, for every single review, every single one of the 100s of times we display it. Why not instead de-normalise each review just the one or two times when the review is approved or updated?
Of course we moved the de-normalising of data to our internal app. Every time something happens in the internal app that affects the visibility or content of a customer-facing review, the internal app makes the necessary changes to the SQL normalised model, and crucially now has the extra work of generating a flattened, de-normalised version of that same review for consumption by our customer-facing service. We traded customer experience improvements (hard to understate how important this is for us) for additional storage space (cheap tending towards free)
Our internal app is now generating 1,000s of these models, and throwing them all into a big, flat view store. Right now this is just a big ol’ SQL table with a bunch of JSON blobs stored in it.
The Outcome
This one’s short and sweet. Our average performance for serving reviews on our site improved almost 40-fold, going from 1,000ms to an average of 30ms. SQL server load has seen a corresponding drop off, and the additional load on our internal moderation system has gone completely unnoticed. The service response before and after was something like the following:
What’s Next?
Like all good software architects, we’ve deferred as much effort to solving the problem at hand as possible. We’ve not rolled out any new physical or virtual infrastructure, we’ve not invested in any new tooling or technologies, we’ve not rolled out any new software components.
We have solved the problem at hand of course, but there are some pinch points that could come and bite us. For example, what if we need to start generating multiple different views for multiple consumers from our normalised SQL model? In this case we could look at using messaging-based architecture, and creating separate components that are responsible for maintaining the view stores and serving the views for each consumer.
Or what if we need to start offering more advanced filtering and sorting of review data? It would be tempting to see how far our single table-of-BLOBs could take us by adding additional columns just for sorting and filtering, but in truth we’d be quickly looking at more flexible search-and-faceting tools such as Lucene, ElasticSearch, etc.