Posted on

Application rewrites?

Legacy applications…

… the regular pain in everyone’s back.

Do you have a legacy application application that needs updating or even a rewrite?

The failure of application rewrites often involves a combination of factors, and among these are the following common culprits and factors, being the primary reasons for the failure – not the language it is written in, or the conversion from one language to another or that language not being able to do the work of the other language.

Any turing complete language could do the job, literally, even such a hellish language as brainf*ck..

This shows that the language in itself, is not the problem, and it more often than not comes down to being a case of – what is the right tool to do the job?

I prefer working with Go, as it is a modern language that works equally well on almost any platform, and it is fast to develop and get working results in.
It also has the benefit of being close enough to many other languages, that devs of those, can understand it without any issues.

Also, Go, does typically not really have the inherent issue that many other languages like Python, Java or C++ suffers from with the “legacy library hell”, as it has a modern take on this, with efficient ways of keeping up to date through various mechanisms.

So what IS the big issues then?

Let’s say we start with a classic “problem child” of yesteryears – COBOL, symbolizing the language landscape of more “mature” languages, and using this as an example, as the very same core issues can be applied to pretty much any language.

It really doesn’t matter what the language is, but the underlying problems are commonly the same for all legacy applications.
Lets look at a few key points. 

1. Lack of Documentation and Knowledge

  • Legacy systems like COBOL often lack detailed, up-to-date documentation. Over time, original developers may leave, and institutional knowledge is lost.
    This often enough comes from the claim that code is the documentation in itself. This has never been true, nor will ever be true.
    The the code is merely the implementation of the specification and what you did, and never the documentation itself, this specifically as the code itself never describes the original intent of what you set out to do. I will accept documentation to be in the code on the condition that, the comment proceeding the code block explains your Intent and what you plan to achieve, returns etc before you actually write the code. This helps maintainability, as now you can check against this common if it actually does what you said it will do.
  • Business rules, logic, and workflows are often “baked into the code” without external references, making it hard to replicate functionality correctly.
  • Another common example in many Legacy applications is the use of “Magic values”, poorly or undocumented numbers that has specific meanings, But is commonly used throughout the code and will changing such a value can have catastrophic effects. Especially where you do not expect it.

2. Underestimating System Complexity

  • These systems have grown organically over decades, often integrating with other systems and processes in undocumented or implicit ways, sometimes using protocols that no longer exists or is poorly documented by themselves, and this is specifically the often the case in the use of proprietary protocols, and even more so when custom hardware is involved, never mind the eternal curse of undocumented storage formats, and especially binary such.
  • Dependencies are not always well understood, leading to gaps in the new implementation.

3. Scope Creep and Poor Requirements Gathering

  • Stakeholders might not fully articulate all requirements or fail to prioritize them.
  • The rewrite team might inadvertently “over-simplify” or “over-engineer” the replacement, causing mismatches with actual needs.
  • While there may have been initial documentation, often new additions and rewrites will rarely or never capture the changes in documentation, as many developers still thinks “code is the documentation”.
    I have never, even until today, seen documentation being a priority to any greater extent in any commercial Products, with the exception of mission critical systems such as aerospace, oil industry, nuclear, and to some degree medical or similar, and even thenThis is often not by any other means then force of regulation. From what I have seen over my years, so far, anything in the financial industry, looks more like a joke, than anything else.

4. Mismatch Between New and Existing Systems

  • COBOL systems often interact with old, niche hardware and protocols that are difficult to replicate or interface with modern platforms.
    See my previous point.
  • Rewrites might inadvertently introduce performance bottlenecks or fail to handle edge cases that the legacy system managed, and again this is often down to poorly understood original requirements and specifications that may not even original requirements and specifications that may not even be available anymore, together with the fact that many developers simply will not together with the fact that many developers simply will not sit down and read such documentation to actually understand what the code does originally.
    On a commercial side, there is rarely time allocated for any of this anyway, and you end up paying for it over and over again often to costs exceeding what it would have taken to allocate the time initially, doing it as right as you can, from start

5. Cultural and Organizational Resistance

  • Organizations often resist change, especially when it involves mission-critical systems.
  • Lack of buy-in from stakeholders or fear of disrupting operations hampers the process.

6. Testing Challenges

  • Legacy systems often run for years without interruption, with real-time updates and transactions.
    This concept often introduces the fact that you actually have no clear understanding of what’s actually and really running in the machine, especially when hot patches has been applied.
    While it can be simple enough for small systems, with bigger systems the complexity often grows exponentially.
  • Rewriting introduces risks, and testing environments struggle to replicate the production workload, leading to missed issues.

7. Skill Gaps

  • Teams tasked with rewriting may lack knowledge of legacy systems and their quirks.
  • Similarly, COBOL developers might not be part of the rewrite team, leading to a disconnect between old and new paradigms, including lack of knowledge transfer and especially so for the original intent and meaning of certain things.

8. Cost and Time Overruns

  • Rewrites frequently underestimate the effort required, both in terms of budget and time.
    This is often down to poor pre-analysis and understanding of the complexity of the task.
    A rewrite is almost always more complex than writing a new application from start, because of all the hidden complexities.
  • Incremental delays add up, and as costs mount, projects are abandoned or deemed infeasible.

9. Failure to Preserve Legacy Business Logic

  • Legacy systems encode decades of evolving business logic.
  • Translating this logic accurately to new systems without introducing errors is extremely challenging.
  • As a consultant on such matters, I often come to the point where the recommendation will be to simply start over, writing the functionality from a clean start, based on the existing perception of, and the requirements for the business logic.
    For such projects documentation (specification, documentation, and intent comments in the code) is always a high priority for future maintainability.

Key Point: Lack Of Documentation Amplifies All Other Problems

When documentation is lacking, every other issue is compounded:

  • Reverse engineering logic consumes enormous time and resources, and the risk some missing quirks, hidden behaviors etc becomes increasingly large as the by the complexity in size of the application.
  • Testing becomes harder because edge cases are unknown.
  • Training new developers is significantly more difficult.

Solution Approaches?

  • Incremental Modernization:
    Instead of a full rewrite, gradually modernize and refactor specific components.
    Where possible, break out the individual piece and serve it in a new setting.
    One issue at a time.
  • Automated Code Analysis:
    Use tools to extract business logic and system dependencies.
    This is one of the things where AI tools can actually make sense, to capture very complex logic behaviors, putting them into simpler words and shortened versions, that’s easier to understand, Giving the developer a head-start understanding of what they are looking at.
  • Collaborative Teams:
    Combine legacy system experts with modern tech specialists.
    Not leaving the “legacy teams” behind. They can be absolute key to your success of the rewrite!
  • Prioritize Documentation:
    All documentation should focus on practical maintainability. 

    Document as much as possible before starting the rewrite and throughout the process.
    The documentation and specification is, in the end the benchmark and test specification, to which you measure “are we there yet”?
    Do we have the correct and expected behavior?
    Also, for the documentation, don’t overdo it.
    There is a very valuable balance between detail and general overview.
    The specificiations should be the absolute, non-negotiable requirements.
    The maintenance documentation, should be practical how-to’s with necessary examples and details.
    The protocols and data items should be explained in detail, as this is the basis of all logic.
    The data flows should be made clear between the components.NB!!
    The code is the explicit implementation – NOT documentation.
    I am sorry, but if you claim that the code IS the documentation, you are wrong, for the very reason that anyone can read what you did, but not what you intended to do, and because of this, you should always write “intent documentation” as a block before the actual code, and before you write the actual code. This way, you or anyone else have a fightng chance of correcting mistakes made.
    This intent documentation often happens to be the same as the specification, and now, it becomes relatively easy to compare the intent to implementation, to see where the bug is.

Why Modern Languages Like Go?

  • Go offers significant advantages for rewriting legacy systems:
    • Simplicity: A minimalistic design reduces complexity, making it easier for teams to adopt.
    • Concurrency: Built-in support for concurrency enables efficient handling of modern workloads.
    • Performance: Go’s compiled nature ensures high performance, rivaling C/C++ in many scenarios.
    • Deployment: Go’s single-binary model simplifies deployment processes, especially in cloud-native environments.

Conclusion

Rewriting an application is a significant undertaking, but with careful preparation, stakeholder alignment, and the use of modern tools and languages, it can transform outdated systems into robust, efficient platforms. By addressing risks head-on and employing best practices, organizations can successfully modernize their applications while minimizing disruption and maximizing value.

Do you have legacy applications that needs reworked, modernized or documented?
.. all while using modern tools, technologies, and keeping future maintainability and support in mind?

Let’s talk.

Posted on

AI use and security

Thoughts on AI, Security and practical day to day use.

As mentioned before, I am involved in R&D on a similar branch with AI Hardware“, and this brings me to the AI and it’s more general use.

These days there is almost a competition going on out there about the use of AI wherever possible, regardless of  whether it’s needed, practically usable, it actually serves a purpose or not, and it’s quite understandable because it’s quite hard to sell a product and be competitive without having the word AI crammed in somewhere in the sales pitch these days.

So let’s have a little bit of a pragmatic look at it.

So what is an AI?

Most of the Ai’s today are LLM’s (Large Language Models) based on software that emulates neurons and uses large masses of data for training material (internet).
There are basically three models and how you train the AI’s, but most importantly no AI’s are trained on the fly because that would effectively destroy the neural network setup while in flight as is now.
We are simply not there with dynamic AI LLM’s, just yet..

You would only retrain the models based on existing data plus any additional data gathered during training sessions, as the training is very taxing on computational and financial resources.
this in turn means that live leakage has a very low risk but the risk for future leakage is still there due to the incorporation of training material that may be gathered from questions and other supplied data.

  • The basic training model is the AI gets a set of data and trains itself on it without supervision.
    Obviously this is not a very good way to train a model and the outcome is rarely what you wanted to be, but all starts with this one.
  • The second model is supervised training where you give it hints as to what is correct or not, sometimes rules that must be met, encouraging the model to make the right decisions, or tweak its output to match. This model is the one typically used in large language model training, where more expressive solutions and reasoning is required based on the input, as this model offers a greater deal of freedom compared to other third model. It also allows foor evaluation of incorporated 3rd party data as part of the input to create a reasonably balanced, but not always correct output.
    You can quite easily force the model to provided an incorrect answer by forcing it to accept incorrect source data by saying that you can’t deviate from this.
  • The third model is the most restrictive, where you have “punishment” involved, essentially killing the models that gives the wrong answer against the set of fixed rules during the training, letting only the ones that provides the most correct answer live for further refinement. This model is typically used in image recognition and similar tasks, and is the one typically applied in machine learning where you have a fixed output that you need to match against and the AI is set to recognise with either a correct result or not a correct result as an output, or where you need to identify an object.

As always, there are of course variations to the above, but it gives you a rough insight as to what it is and how it works.

A little bit of history from a developer’s perspective.

In the past when there were only books and manuals, the developers had to relate to these, and often know them by heart to actually use it. The amount of information was quite limited and it was fairly easy. Essentially everything was written from scratch.

As we know, history happened, and Internet came to be, and with it, things like Google. Open source solutions exploded, and then came the help sites to go with it and anything development.
Sites like stackexchange and many others came to be and code samples were shared between the users. Because of the perceived security risks, many developers were banned by the companies to use Internet to search for solutions, even for common problems because of the “risks” involved, as you could get “hacked” by a copy/paste, or you could leak information about your ip or other precious items.
This even in cases where it was generic information such as an error code and what caused it in regards of a specific product. Eventually, “internet” was generally accepted, and came to be part of everyday business life.

The primary risk of this was/is , as always, related to anyone who uncritically made a true full copy/paste after providing enough specific enough information for a possible hacker to write a malicious piece of response that would work in the specific environment and solution, and the user subsequently, without any consideration or review, and there being no peer review on commits, implemented this in the production code.

The exact same can be used and said for any AI, whether it’s in app, typically a developer IDE, or external like ChatGPT’s website, and many infosec teams seems to ignore the fact that you can search and look up search terms on Google to see what’s being searched, providing the exact same purported leakage mechanism, near, if not real-time, whereas this actually and typically does not apply to AI’s. Never mind questions on a website like stack exchange that it will be forever in fulltext, where the AI question will be ephemeral and not be reused as a verbatim ready made answer, despite it possibly becoming part of the training material at some point down the line, as numerical weights, and not actual full-text.

In short – an AI will not be able to recall or reproduce a specific question from another user because that’s simply not how the AI and LLM’s works. The sessions work in isolation, but the data may later be used as training, as numerical “weights” for a specific item, never as plaintext data. 

Why the AI instead of “Google”? 

As Google and others are mainly about static content, the AI is highly dynamic and can actually understand what you want, and quickly narrow down the answer to what you need, without all the “fluff” and having to wade through endless amounts of text and sites to get what you were looking for, and they can do this by incorporating third-party sources or doing searches on your behalf to gather the information, and this is exactly what makes the AI so useful – speed and the output limited to what you actually asked for, and this is why the AI is quickly becoming the “Google” replacement.

An example:
Compared to Google et al, If you’re stuck in a problem, you can describe the type of problem you have and get the reasoned argument with explanations specific to the problem on how to solve it, without you actually assembling and parsing the information yourself, something that can be very tedious.

AI:

(This passes my sanity inspection as a proposal for a solution…)

…. versus Google:


What about Information Security then?

A couple of ground rules when it comes to dealing with AI’s:

  • You should be careful with what you share – keep it “anonymous”.
  • Don’t share more than you absolutely need to.
  • NEVER share credentials or personal  data.
  • NEVER assume it that the answer is correct – only use it as a guideline or example.

Again, always be the second opinion – never just copy/paste – actually look at what was presented and make your own informed decision of – does this make sense, and never assumed that the answer is 100 percent correct.
This is what any responsible developer would do, and if there was malicious intent it would be far easier to do it themselves, right there, rather than go to the AI to get it done, as the developerwould not need to explain to the AI what the environment looks like and how the specific exploit should be implemented. That would be information you already have.

The goal of the Infosec team here is to rather than just ban the users from using it, embrace it, but educate the staff about how to use it safely!

Prescribe the pragmatic safer ways on how you can interact with the AI’s, because in the end they are incredibly useful tools that will not go away, just like Internet didn’t go away and eventually was forced to be accepted despite the security teams kickings and screamings.

Trust me, it will be used no matter what anyone say, because it is just too useful not to use, and the likelihood of this happening is even higher in time and resource pressured teams, where a lot of tedious work can be simplified and done very quickly compared to the alternatives, and it is far better to have a mutual understanding of good practices, do’s and don’ts, rather than a skunkworks divisions.

Additionally, keep in mind that it is an indisputable fact that the absolute majority of security tools today use AI, be that code monitoring and validation, security tools like antivirus, api scanners and many others, inspecting code, classified document files etc regardless of security markings, on pretty much any hardware the company owns or maintains. It’s already there, and if there was a leak you would likely not know about it until way too late (specifically looking at you, MS Copilot), and such an event would be a much bigger possible threat than the occasional use of AI for a specific purposes with properly trained staff.

All these modern security tools are entirely based on AI or AI input / processing, and all will suffer the same issue of possible data leakage, one way or the other.

Let’s be very clear about something here:
Any tool that claims it will not be using the customer data, is simply marketing hype and lies, because if they did not, they would soon find themselves out of business as they would not be able to follow the evolvement of code and security threats, compared to their competitors. All the talk about “secure models” etc, is marketing fluff. Where do you think their current training material actually comes from?
Hint: They didn’t invent it…

If you deploy any of these AI security tools for wholesale scanning of the company IP, it makes absolutely no sense to at the same time unconditionally ban the use of AI’s for the developers or other creative staff, because as mentioned before, staff training on proper use is the absolute key here, and a kneejerk ban because you’re afraid of possible unknowns, is absolutely NOT the answer, as all you will achieve is to create an unsafe skunkworks project. Like it or not. It’s reality…

Takehomes for the security team:

  • This is something that is here,
  • You can’t ignore it,
  • You can’t make it go away,
  • It’s here to stay.
  • Accept the fact.

Deal with it!

The only reasonable thing you can do at this point, is to accept “defeat”, just as you eventually had to with the emergence of the internet, and train your staff in the reasonable use, protecting the company ip and personal data, making sure that security is covered by providing working guidelines of do’s and don’ts, allowing an agreed controlled use rather than the chaotic underground skunkworks model that otherwise will emerge regardless of what you say, and over which you will have absolutely no control
Never mind the fact that you will effectively “outlaw” most, if not all modern developer IDE’s, which… is commonly based on AI support, in part or full using their code models, relegating them back to notepad or similar “development” tools. 

Trying to ban the use of AI, will be as effective as the 1920’s prohibition was… (NOT!). 

Then what?
You should consider specific services (and I am not plugging anyone here) like ChatGPT’s enterprise model, where you can actually get the benefits and control security/privacy, yet, prevent any leakage and reuse for training. 

If you can save an hour a day per dev, increaseing the productivity of them, this will be an easy expenditure for you to qualify the benefit of, where you gain control over what is done, how it’s done, who does it, on what basis they do it. It’s a dual win-win that will gain acceptance. 

If you can’t beat them, be pragmatic and join them, making sure it’s done responsibly…