Razor sharp view performance through New Relic and Crate

EDWIN VAN LEEUWEN

Our Price Check product gives farmers detailed benchmarking information for their purchasing. An individual farmer would have over 100 direct products to review, with reference to pricing on another 4000 alternative products; and the system presents the results from tens of thousands of underlying data points (and very rapidly growing). These results can then be filtered, searched and sorted.

All this was making for some pretty complex queries - and performance issues were starting to appear. We always knew that as the service had grown, some of the API calls could potentially become problematic. A single slow or non-responsive call on the platform can affect the overall response time, leading to bad customer experience. It also prevented us improving our feature set for the product, which would have compounded the performance issues.

Traditionally, the engineering team would have spent significant time investigating and troubleshooting these issues - and perhaps never really find all the loose ends to optimise the service.

At Yagro we looked for tools instead. Last month we adopted New Relic for monitoring the performance of our infrastructure and applications. The service gives us direct visibility of what's going through the system, where the bottlenecks are, and allows us to drill down into individual queries and calls to isolate problem areas. This has provided some pretty interesting insights into what is happening on our platform.

Our goal is to get all API calls on the service below 200ms. We could now see individual queries taking more that 20 seconds. Pretty alarming. The following charts show two examples of views using the same query, showing peaks in demand as the calls are made. The different colours are the different stages the view goes though to retrieve and present the data.

Having rapidly identified the source of slow performance, we investigated the underlying query and determined that:

The query needed to be complex.
It was not going to get much quicker.
Caching would not solve the initial case.
The response of the API was also quite complex and large.
It was working on data that is periodically uploaded.

The last point was very interesting and is how we came up with a solution. Rather than query and generate the data every time a user opens the view, we can process it a single time when data is uploaded on the system. We can also extract key fields into a single cached table, for rapid search, sorting, and querying.

To do this, we use Crate database alongside our Postgres database. PostgreSQL is great for storing our dynamic data, but we are doing more and more with time series data, where transactions are less important - so we had already implemented CrateDB for other projects. It was easy to setup and run in our Kubernetes cluster, and with its SQL dialect, we can talk to in ways that we are used to. But it also has many features of NoSQL databases such dynamic schemas - and will scale very well. Finally for this project, Crate has the important ability to index and store arrays and objects, and do full text searches of fields. And it's fast. A lot of this could have been done only with Postgres, but we prefer the agility and performance through Crate and think over time it will scale better.

So we now use Crate to store the view content, generated from the data in Postgres. For the user, the view can therefore simply pull from pre-generated content.

The last part of in our optimisation was to cache the response of the view within Redis.

Did we manage it? Let's go back to our friend New Relic:

The vertical bar near 20:00 shows the release to the cluster. After this, the view barely registers on this chart thanks to our improvements. We have to zoom in to see it doing things.

The average response time for this view since release is 104 ms! And once cached on Redis, we have it down to 50ms.

In summary, we have carried out a dramatic performance improvement to a key part of our web service, aided by tools such as New Relic and Crate. We are very happy with the results and will continue this rapid diagnostic approach to make further improvements. We hope our users are also pleased with the results.