I recently visited a company to talk with them about how I might be able to help them with their IT and development needs. They’re an international operation and they produce components used in industrial applications.
Their website is a rich e-commerce platform and it provides informational resources to their clients. They also have a substantial set of applications that folks within the company use to provide content, product information, etc. for the public-facing site. So I was surprised to find that they employed just a handful of developers.
The developers appear to be very disciplined. They’ve adopted a version control system and a development-staging-production model of deployment. The production server is hosted in a top-tier, actively managed data-center and the hosting company actively maintains a duplicate server that they can quickly spin-up should the main server ever have a problem that prevents it from operating properly. What I’ve seen of the various pieces of the system seems relatively responsive.
They admit that the system is overly complex and that they’re not actively monitoring it. They also admit that their chief mechanism for determining when the system is in trouble is based on the complaints they receive from their customers.
Are there some red flags? yes. Does it look terribly different from most of the other companies I’ve come across? No, not at all.
First off, I agree with them on the problems they know they have. The developers agree that system has a lot of “moving parts” and they’re relying on their users to let them know when their system is having a problem.
Their understanding of these problems will likely move them toward an active monitoring system and, possibly, some under-the-hood reorganization that will tend to incrementally simplify the system.
My goal is to try to help this company move to a less uncertain future. I see a number of potential issues, the least of which is the backlog of wishlist items for which they’ve been attempting to locate additional developer resources. I’ll be making every effort to help them see the shortcomings of their current system and move toward a less uncertain future.
It’s pretty old-school. I’m not trying to start a flame war and, honestly, it’s reasonably responsive, at least on the user-facing side, so I’ll refrain from mentioning it. With that said, I firmly believe that ruby and/or python become force multipliers in the hands of talented developers when compared with most of the popular web-development languages that came before them. A move in this direction would greatly reduce the amount of developer time required for the more complex tasks and allow them to more easily hire developers as it becomes necessary.
I’ll backpedal a bit on this and concede that they can and should make strategic shifts in this direction and not attempt a ground-up rewrite.
Automated testing is a powerful tool for taming the problem of system complexity. While the developers are doing some manual, checklist-style testing that centers around recent development efforts, their users have effectively become their de-facto testing apparatus.
There is no question about the room for improvement here. The developers conceded that their choice of platform made automated testing more difficult than it might be for others and I absolutely agree, but it’s far from impossible and, as users become used to and dependent upon high levels of availability for the applications they use, this will be of critical importance in the future.
Currently, the whole system is front-end code tied directly to database-access apis. There are a number of ways (Selenium, Capybara, Splinter, etc.) to test the front-end code and an effective minimal strategy might be to simply author new tests as problems arise. Another approach to improving the test-ability of the system would be to separate the front end from the data layer via simple apis. The apis are easy to test themselves and, by de-coupling the front-end code from the database apis and substituting apis which are infinitely easier to mock in a testing environment, the front-end is instantly more testable as well.
This is where the potential to go off the rails is tremendous. Without active monitoring, the developers (who double as administrators) must first work to properly identify the issue when users report problems. But that is really just the tip of what really is a massive iceberg of risk.
The rest of the story centers around recovery and the ghoulish what-ifs that keep people like me up at night. The biggest problems facing this company when things go really wrong are: a significant dependency on 3rd party hosting companies, the knowledge of a small number of key personnel, and properly maintained documentation.
Mitigating the risk of significant downtime and loss of data during disaster recovery are:
- a backup of the production server which is maintained in parity by the 3rd-party hosting company
- a staging server which is kept in the same state as the production servers
- backups of data and code
- developers with end-to-end knowledge of the entire system
The problem with backup servers (both production and staging) is that they’re rarely tested. In the case of the hot spare maintained at the hosting company, it is only rumored to exist. The staging server has never seen a hit from a user outside the company’s internal network. That it could be made to stand-in for the production server should the need arise is arguably correct, but the amount of time and effort required to make it function and perform to the expectations of the production system’s users is questionable.
The problem with documentation is that it is never up-to-date. There are a variety of reasons for this, but chief among them are: it’s rarely needed, it’s updated infrequently, it represents the state of a changing system at a single point in time, and it is rarely so complete as to be an authoritative reference for a ground-up restoration of a broken system.
The problem with depending on developers with end-to-end knowledge of a system is that, from time-to-time, the leave. They get new jobs, get sick, die, and go on vacations. They also forget. Even the smart ones. And when they do, they rely on the documentation.
The problem with backups in systems such as these is that they require manual intervention to be of any use. And that means developers and documentation.
The “long tail” scenario here is that, while it’s true that the efforts this company has made to prevent catastrophe are effective and they’ve almost certainly worked in the past, they are by no means infallible and, when it goes bad, it can easily require weeks or even months to fully recover. The probability of disaster has been managed to some extent, but, in terms of lost sales and IT resources, and hamstrung staff who are dependent on the proper functioning of the system, the potential for loss in the event of a disaster remains high.
A targeted attack by ne’erdowells, inadvertently destructive code, and plain old negligence are just three of a long list of problems that could cripple this company’s ability to do business. Add to that the possibility of staff unavailability due to illness, vacation, or simply poaching by a competitor, and a simple outage can become a serious problem to company shareholders.
As I mentioned above, this company could get a lot of mileage out of some modest investments in active monitoring and automated testing.
Monitoring. For monitoring, we like Nagios. There are several other options in this area too, but the main idea is to get as complete of an idea of the problems with the system as quickly as possible. In the best of circumstances, a good monitoring solution will alert responsible parties to an issue before the users have had an occasion to notice. This is low-hanging fruit.
Automated testing. This would be of tremendous help to the developers, both in the day-to-day course of their development cycle and as a recovery tool to ensure that the systems are functioning as they should in a post-incident scenario.
Platform. From my soapbox, I’ll sing the praises of open source as loudly as I can. I think they could benefit greatly from a switch of operating system, database, and development platform (on the back-end, for a highly testable data layer, at the very least).
Automated Deployment. This is the A-1, prime solution of which that this company is in dire need. Tools such as Ansible, Puppet and Chef allow for scripted configuration of servers and they handily convert 12-hour installation procedures into just minutes of hand-off automated provisioning goodness. For one of our clients, I’ve used these tools to create server creation procedures that involve a single command. They can similarly be used to develop tools to snapshot, archive and restore entire warehouses of back-end data. Put more simply, using automated deployment tools, it is possible to create systems which can be built from the ground up in minutes using developers who are brand new to the environment. We’re in the process of handing over just such a system right now and I can say unequivocally that, without such tools, the acclimation of the new developer would be far more expensive for our client and would be fraught with many more problems than we’re seeing now.
As companies become more dependent on their IT infrastructure to conduct their day-to-day operations and as users become more used to a highly available networked world, it is increasingly important that IT departments stay abreast of current tools and technologies available to mitigate risks associated with failures in their IT systems, be they failures of hardware, software, vendors or IT personnel.
If this company’s situation sounds like your own, give us a call. We live for this stuff.