Recent Performance/Stability Issues

zachbuttram Platform Engineering 1 Comment

Hello from the CrystalCommerce engineering team! We know there have been some stability and performance issues lately and we wanted to let you know that we hear you. We’re hard at work on some improvements that should help out and I’m going to detail them here.

There are two major issues we’ve identified that are affecting us right now. The first is an issue around AWS autoscaling and load balancing. The default AWS autoscaling policies are not sufficient to anticipate the type of demand our systems recieve. To combat this, we already use additional custom autoscaling rules to accomodate for high-traffic occasions and attempt to anticipate demand more effectively. Unfortunately it seems we’ve reached a breaking point where we need to reevaluate our custom rules and tweak them in order to better accomodate unanticipated increases in demand, and we’ve already started that process. In addition, the AWS default load balancer configurations are not properly detecting misbehaving instances of some of our apps, leading to situations where some fraction of requests are sent to a server that will no longer properly respond. We’ve already identified a way to improve this and are working on implementing it in our AWS configurations.

Separately, we’ve identified another issue with database performance. Our current infrastructure already includes multiple database servers where different client databases are spread between them in order to reduce load, but due to factors including increased demand and background job processing we’ve seen performance issues emerging around certain database queries. In order to improve stability and allow us to bring you new features without compromising on performance, we’ve started down the road of adding read-only replica databases to our infrastructure. This will allow us to separate reads and writes to the databases such that for each client, all writes will go to a master database, and most reads will go to a connected slave database that is replicated from the master database and is read-only, effectively spreading our database load out even further. I say most reads, because some reads will rely on newly inserted data on the master server and must be sent there still, but this will still be a fraction of the total reads we’re currently asking of our primary database servers.

We’re looking forward to bringing you these improvements as soon as we’re able to. We’re also working on bulking up our team with more operations knowledge to assist with issues like this now and in the future. As engineers, downtime is something that we’re never happy to see because we know how it can affect you and your business, so we’re going to continue working to improve our platform’s stability until we can all feel better about it.

Comments 1

Leave a Reply

Your email address will not be published. Required fields are marked *