Legacy System Decommissioning - A Case Study

Giulio Cesare Solaroli • Case Studies • May 27, 2024

Intro

Imagine a Company whose main operational system still relies on functionalities provided by a COBOL application running on an aging AS/400 system. For years, plans had been made to replace this ancient component, but no one seemed willing to finally pull the plug.
This is a brief account of how we helped this client fulfill this long overdue task.

Challenges

Legacy AS/400 application, in charge of computing prices for all bookings,
one single developer able to update the code, already planning for his retirement,
old hardware, with some components already failing and replacements not available from the vendor.

Scenario

The AS/400 application had long been in the process of being replaced; a new application had been developed revamping the whole UI (the new application UI being web-based, instead of terminal-based like the old one); many of the features had been moved to the new application, and all of the data had already been mirrored for the AS/400 system to the new SQL-Server backend system.
But the old AS/400 system was still the "source of truth" with regard to booking prices and inventory availability.
A new "price engine" had been built, and it was already running on the side of the "official" (legacy) AS/400 code to compute the price of all bookings; some evidence of the differences between the data computed by the two engines were collected, but without any details useful for a deeper analysis. The only data point available was that the two engines were computing the same price around 70% of the time.

Risks

Passive Risks

The legacy AS/400 system was labelled as a "ticking bomb" for many different reasons:

it was running on old hardware no longer serviceable, nor covered by any effective support agreement,
a minor hardware component (disk cache card onboard battery) had been failing for some time, and the vendor was not able to source a replacement unit, regardless of the price,
only one developer was able to apply the regular changes on the legacy codebase needed for guaranteeing business operation continuity (the creation of some inventory required intervention on the code itself, on top of some "regular" configuration activities),
the single developer able to modify and deploy the COBOL codebase of the legacy application was planning to retire.

Active Risks

Dismissing the legacy system would have required a few changes at the very core of the application architecture. All these changes were very sensitive, but with a different level of risk associated with each of them.

Source of truth

The "source of truth" would have to be moved from the legacy application to the new one; as the legacy data was already synchronized very frequently (there was a job, labelled as "near real-time", running every few seconds) and most of the further data processing was already done using the new database, this concern was labelled as "low risk".

Decommissioning the integration process

In order to keep the legacy application database synchronized with the new application, a set of intermediate structures and processes (stored procedures) had been developed and were constantly running.
To limit the impact of the changes to the system, these structures and processes have been mostly retained, changing just the least amount of code of the procedures to rewire the impacted processes to use the new components (mostly the price engine) instead of the legacy ones.
This means that the legacy flow of information has been retained, even if its presence was no longer needed in the new configuration we were aiming to implement; we have decided to retain some extra complexity in the system (aka "technical debt") in order to limit the amount of changes needed to update the procedures involved into the affected processes.
We have labelled this solution as "medium risk", because it was affecting the final state of the transition to the new application leaving around vestigial structures that were no longer needed; more moving parts meant more options for issues to emerge, and harder times investigating and resolving them.

Price Engine replacement

This was the core component being replaced, with a direct impact on prices computed for the users; the biggest criticality we identified was related to changes to already-booked bookings, as the system would recompute the price even when marginal changes were applied (eg. adding your passport number to the booking information). This extensive re-computation of prices may have exposed differences in the actual price computed by the two "price engines" to the final users, with also the possibility of marking some fully-paid bookings with due amounts (if the new price was higher than what was already settled).
In order to limit this option, we have worked to improve the effectiveness of the data collected during the "parallel run" of the two price engines, in order to let us understand which were the actual components of the final price that differ between the two systems.
When we started planning the decommissioning of the legacy application, the system was only "counting" how many price computations provided different results between the two systems, but without tracking any details; this information did not provide any ability to observe the differences to identify the core issues that would need improvements before the final cut-off date.
One of the very first steps we took in this regard was to greatly increase the details of the prices computed by the two engines when they would provide a different answer.
After having collected these differences for a few days, we could analyze the data and identify a few scenarios worth investigating. This proved to be pretty effective in fixing a few problems, which we then could validate as "fixed" as they stopped showing up in the following reports. These reports used the data collected to compare the two price engines.
We iterated this activity a few times, until all the reported differences were marginal, both in terms of economic value, and also in terms of the scenario triggering the issue.

Interesting finding

A very interesting finding of this activity was that some of the discrepancies between the two price engines were caused by "errors" in the legacy system, and not "bugs" in the new application.
Indeed, there were reports of issues with prices computed by the legacy application that the single developer was never able to address, and that business people were handling adding manual adjustments to the final price of the booking; the new application, having been built providing developers with the "business rules" (and not the legacy code), was able to correctly handle some scenarios that had to be amended previously.

Risk final considerations

After having identified all the major sources of risks, we planned and acted to measure and actively mitigate the major issues for which we had some leverage.
For risks for which we had no leverage (e.g. hardware issues), we regularly monitored the situation to have constant feedback and influence the actual planning of the remaining activities.

Actions

Feature free

We agreed with the business that we had to take a three-month "feature freeze" on the system in order to focus all developers and system administrator energy and attention on the activities needed for decommissioning the legacy application.
We had great help from the support team that handled all the business requests and expectations on our behalf, allowing us to focus on the daring task we were handling.
This caused a relevant piling-up of activities to be done after the switch-over was completed, but we were able to eventually address all the requests, just with some extra latency.

Disabling users' direct access to the legacy application

Some users were still using the legacy application directly (as demonstrated by some evidence collected looking through the logs) and we had to investigate why this was still the case. As the reasons were mostly due to old habits and not actual missing features in the new application, we disabled all direct access to the legacy applications a few weeks before the cutoff date, in order to give users some time to report back potential issues using just the new version of the application they may have forgotten to report. However, after some initial minor complaints due to some old habits being "decommissioned", all users were able to achieve their tasks also with the new application.

Aliased all the legacy system structures

In order to preserve as much as possible of the integration infrastructure, and to avoid breaking all the stored procedures involved, we have created a clone of the database structures managed by the legacy system, and we moved all references to the old structures to the new ones.
This allowed us to avoid having a massive amount of stored procedures failing to compile, with the risk of having the whole system grinding to an halt.
Preserving these –now useless– structures instead of fully removing them dropped the opportunity to streamline the system and do some long overdue maintenance tasks, but allowed us to avoid the headache of having to fix all compilation errors off-the-bat, and allowed us –instead– to monitor who would write to the legacy structures, in order to spot issues early on.

Cut-off date

The activities involved in the actual decommissioning were pretty intense, so we allocated a full weekend to perform all the tasks minimizing the impact on the business operations.
As the US offices mostly generated the activity on the system, we selected the weekend before July 4th (it was on a Monday in 2022), as this provided us an extra day of low business activity where we could monitor the new configuration of the system to validate the success of the operation.

Outcomes

When operations started on Monday morning (July 4th, 2022) everything was apparently working as usual, so much that some users (aware of the massive operations happening during the weekend) asked if the decommissioning had actually happened or had it been rolled-back.
The major issue that was reported was a missing report (an important one though) between the reports that were produced every morning; identifying the issue and generating the missing report took a few hours, and the users eventually received it before the end of the business day, instead of in the early morning as they were used to. Once the missing report was restored, no other issues were reported.
When US operations started on Tuesday morning (5th July 2022, EST), the system was in working order.
Overall performances were slightly improved, but not with the benefits we were hoping to achieve; keeping a lot of the legacy structures in place did not allow to fully streamline the core operations, so the performance benefits were limited.

Celebrations

The decommissioning of this system has been a massive success, allowing the company to neutralize significant risks and remove constraints that had hindered its autonomy in evolving its business system. Unfortunately, the demands of daily operations and processing enqueued tasks quickly took priority. As no major issues emerged in the following weeks, the achievement was soon archived, and we never took the time to celebrate it as deserved.

Conclusions

Decommissioning legacy systems is usually a complex task, especially when it involves extensive change management activities across the company due to the many processes affected by the switch. These activities carry inherent technical risks, but their "blast radius" is much wider when considering business operations. Most of the time, it is not possible to fully address and defuse all potential risks. The best approach is to identify the main risk issues and then constantly monitor, measure, and assess their status in preparation for the migration. This helps inform and adapt necessary actions to achieve the desired outcome. Such clarity is also extremely helpful in managing communications with stakeholders, as keeping everyone informed is paramount to the project's success.

About the Author

Giulio Cesare Solaroli aids teams and organizations in:

Analyzing business domains through techniques like EventStorming
Identifying domains, subdomains, and value streams
Coaching and mentoring business, product, and technical teams.

Designing and developing software solutions.

His goal is to steer companies towards a more efficient and competitive digital future, offering expertise and support every step of the way.

More about Avanscoperta

Avanscoperta is an ecosystem of professionals with a great passion for learning: we love exploring new territories, exchanging experiences and ideas in the field of software in its broadest possible sense.

Check out the full list of our upcoming training courses: Avanscoperta Workshops.

Stay in touch!

Do you want to keep reading our articles? Subscribe to our Newsletter 📩.

Giulio Cesare Solaroli

Giulio Cesare brings extensive experience in managing information systems across diverse sectors including banking, cruise tourism, insurance, telecommunications, publishing, research, and start-ups.

Microservices e Domain-Driven Design

Microservices

Nov 27, 2023 • Barry O'Reilly

Residuality Theory for Antifragile Software Architecture

Software Development

Mar 28, 2023 • Gabriele Petronella

TypeScript and React for Scalability

Software Development