Engineering State of the Union: October 13, 2018


It would take a long time to go into detail of all the poor decisions that got us into the position we’re in today, but we wanted to highlight some of the major technology issues that prevent us from fixing issues quickly and how we’re working to improve those areas.

There are a few major areas around DevOps, databases, and rails code:

What went wrong?

DevOps: Server configuration hurting us, not helping us.

In regards to DevOps, our current hosting provider was chosen for emotional reasons and not based on their strengths. It is such an old platform that the provider is shutting it down by the end of May 2019. We’re unable to quickly make changes to servers, diagnose weird networking problems, and utilize any of the basic functionality commonly available by other cloud providers. All servers were hand built with no documentation, leaving our current engineers to copy drive image. This means if we have to make any configuration changes, it has to be manually done to each server. (We have well over 100)

For instance, if we want to try and find an error that happened for a specific request, we have to log into each server that could have handled the request to find the logs. If we want to add log aggregation we have to manually install all the required dependencies on each server and hope they were all built the same way. (Spoiler Alert, they weren’t) What seems like looking into a simple bug usually leads down a rabbit hole caused by all the missing DevOps best practices.

Databases: Poor design impacting performance

The way the databases were designed worked great when 50 stores used CrystalCommerce. A lot of decisions were made to quickly add new features without understanding how those features would work when there are millions of requests relying on them. One example is the way we identify whether a product is in stock. The original code would just sum the quantities of all variants of a product. This eliminates the ability for the database to utilize indexes. We recently refactored the way this worked and it sped this one count 1000x faster. Unfortunately, not all design choices are this easy to fix, and this leads into the issues with the rails code.

Rails Code: Cutting corners and not updating old code

The CrystalCommerce core application has grown from the same code base since September 12th, 2008. Since that day new product development has been prioritized over refactoring and cleaning up old code. Additionally, a lot of mistakes in the basics of making an application were made and copied to over 1000 different places in the code. For instance, view and template code should never directly access database methods, but there are more than 1000 different places in which it does so. If we wanted to change how categories are handled in the database, we would have to fix all of those separate places to properly use controllers to get that information. There are a bunch of great tools to help find errors and bottlenecks, but none of them work properly when someone alters the core structure of rails. This adds 10x to the amount of time it takes to diagnose an issue/error. Almost every action on the site hits the database hundreds if not thousands of times when it should be under ten, and often fixing it would break the 1000 other places in the code that relied on it working the old way.

What we’re doing:

We have lots of activity underway to address these issues: from our migration to AWS to key hires for our engineering department. Once the AWS move is complete we will be solving those 1000 places in the code where it directly accesses the database so that we can make permanent long-term fixes to the underlying data structure. Part of that will involve moving the functionality out of the ancient ‘core’ project and creating new services (we’ve already begun that process with market prices).

We’re essentially changing the engine while flying the airplane, and in a perfect world we would have those key hires already made so that emergency fixes don’t interrupt the work to solve our underlying problems.

Thank you for your patience and patronage with CrystalCommerce.

For future updates:
System Status
For real-time information on system status. Here you’ll find live and historical data on system performance. If there are any interruptions in service, a note will be posted here.

Product News
Find CrystalCommerce emergency news and product updates here. What fixes are released, what features we’re rolling out, what features are in development, and future releases.