AI does a lot. But does it do the job?

Claas
2 days ago
9 min read

AI is everywhere now and most people I talk to are already using it in some form. The success stories are easy to find, and I have to admit that I am often impressed myself, especially when looking at what these tools can produce from very limited input. Give it a few bullet points, a rough idea, or a set of documents and within seconds you get something that is structured, readable, and often surprisingly close to what you had in mind.

The jump from nothing to a first usable version is where the real magic happens. What used to take time, effort and a fair amount of thinking can now be generated almost instantly. In many cases, that first draft is not just fast, it is also good enough to work with, sometimes even better than what would have been produced manually under time pressure. That alone explains why adoption is happening so quickly.

But this is also where expectations are set in a slightly misleading way. The first result creates the impression that the rest of the process will follow the same pattern, that improving, refining, and finalizing the output will be just as smooth and efficient. In practice, that is rarely the case.

The jump from 0 to 1 is real

Once you move beyond the initial draft, things start to change. Iterating on an existing version, refining arguments, improving structure, or aligning content across multiple sections often feels less predictable and, in some cases, even counterproductive. The step from version four to five, or from eighteen to nineteen, is no longer impressive. Sometimes it leads to marginal improvement, sometimes to inconsistency, and occasionally it makes the overall result worse.

I have experienced this more than once, especially in longer co-work sessions or when working with more complex material. After several iterations, it becomes difficult to track what has actually changed, which parts were better before, and where inconsistencies have been introduced. At some point, continuing to iterate feels less efficient than starting over, which is a strange outcome given that the whole idea was to save time.

The co-work mode amplifies this effect. The more complex the topic and the more context you include, the harder it becomes to maintain consistency and clarity. Arguments drift, connections weaken, and small inaccuracies accumulate. At the same time, the system continues to consume resources, whether that is tokens, compute, or simply your own time trying to make sense of the output.

At that point, it starts to feel less like working with a reliable assistant and more like dealing with a colleague who did a decent first pass, but then lost the thread and is no longer able to bring the work to a clean conclusion.

When iteration makes things worse

The core issue is not that the output is bad. In fact, most of the time it is quite good. The problem is that it is not reliably correct, and more importantly, it is not transparent where it might be wrong. A text can read well, arguments can sound plausible, and the structure can feel coherent, while still containing gaps, inconsistencies, or subtle errors that are not immediately visible.

This creates a very specific kind of overhead. You are no longer just reading or editing, you are validating. You are checking whether statements are accurate, whether conclusions are justified, and whether the overall logic holds together. In simple cases, this is manageable. In more complex scenarios, the effort required to validate the result approaches the effort of creating it from scratch.

This is where the initial productivity gain starts to erode. The creation phase becomes extremely fast, but the validation phase becomes the bottleneck. And because you cannot easily identify where the potential errors are, you are often forced to review the entire output rather than focusing on specific parts.

It is somewhat the same like trying to use a 100 pages pitch from a different proposal, where you end up using 2,3 slides - the time spend reading and changing things exceeds the effort for a greenfield approach by far.

Or take the public cases of failure: we have already seen this play out in practice, where generated content made it into external deliverables and had to be corrected afterwards, sometimes publicly. It is easy to frame this as carelessness, but that explanation is too simple. In most of these situations, I would assume that the teams involved did exactly what you would expect: they worked professionally, they reviewed the output, and they applied their standard quality checks.

And still, errors slipped through.

That is the complicated part. The issue is not necessarily a lack of diligence., but that once you work with generated content at scale, it becomes significantly harder to maintain a complete overview of what has been produced, changed, and combined across iterations.

At that point, validation becomes a full reconstruction of the logic behind the output.

As a result, AI works very well as a tool for generating intermediate steps, helping to structure thoughts, or accelerating the early phases of a task. It is much less reliable as a final step when accuracy and accountability matter.

Scaling creates a control problem

As long as you use AI on your own, most of the issues are manageable. You generate something, you review it, you correct it, and you move on. The inefficiencies are there, but they stay contained because you know what you asked for and what you got back.

This changes once you move beyond individual use.

The amount of output increases quickly, but the ability to validate that output does not keep up. A single result can be reviewed. A handful as well. But once multiple people are using AI in parallel and producing content across documents, analyses, and presentations, the situation becomes harder to control.

One additional effect starts to show up in teams. You receive inputs, drafts, or intermediate results from colleagues, and it is no longer clear how they were created. Some parts may be written manually, some generated, some edited after the fact. In many cases, you do not know whether the content has been fully reviewed or just lightly adjusted.

That uncertainty changes how you work. You assume a certain level of quality because it comes from a colleague, but at the same time you know that parts of it may be generated and not fully validated. So either you trust it and take a risk, or you re-check it and lose time.

Both options are problemati and because everyone is in the same situation, a subtle dependency builds up. Each person assumes that someone else has already validated the content at an earlier stage. Over time, that assumption becomes the weakest point in the system.

The result is not a single obvious error, but a gradual loss of control across the whole workflow. At that point, the question is no longer whether individual outputs are correct, but whether the overall system still produces reliable results.

This is not a tool problem

At this point, it becomes difficult to explain the situation as a limitation of the tools.

The tools are doing what they are supposed to do. They generate content quickly, they adapt to input, and they support a wide range of tasks. The issues appear when that output becomes part of a larger workflow that depends on consistency and accountability.

The real shift is not in the tool, but in the way work is produced. Parts of the work are no longer created directly by a person who owns every step, but are generated, transformed, and combined across multiple iterations and contributors. That makes it harder to trace how a result was produced and which parts have actually been validated.

In that sense, AI is already closer to a resource than to a traditional tool. It contributes to the output, but it is not managed in the same structured way as other resources in an organization.

A simple comparison helps to make this visible: if a new team member joins, you define what they are responsible for, what level of quality is expected, and how their work will be reviewed. You do not rely on implicit assumptions.

With AI, this structure is often missing. Tasks are assigned without clear boundaries, expectations are not made explicit, and validation happens inconsistently.

The result is predictable. Output increases, but clarity does not.

Control comes from structure, not trust

If you accept that the problem is not the tool but the way work is organized, the next step is quite straightforward.

You cannot rely on trust alone. Not because the tool is bad, but because the output is not transparent enough to make blind trust a viable option. You need a way to work with it that keeps things understandable and verifiable.

In practice, this usually means reducing complexity rather than increasing it. Large, open-ended tasks tend to create results that are hard to evaluate. They mix structure, content, assumptions, and conclusions into one output, and if something is off, it is difficult to isolate where the problem actually is.

Breaking work down into smaller, clearly defined steps changes that dynamic. Instead of asking for a complete solution, you define individual parts with a clear purpose and an expected outcome. Structure is created first, then content is added, then specific sections are refined. Each step can be reviewed on its own, without having to re-evaluate everything at once.

This is not about making the process more complicated, but making it controllable. A simple pattern that works in many cases looks like this:

one step focuses on structure and scope
another step generates or refines content
a separate step reviews or challenges the result

These steps can be done by different “agents” or simply as distinct phases in your own workflow. The important part is not who does it, but that the steps are clearly separated and the output of each step can be validated before moving on.

It is less elegant than the idea of one continuous flow that produces a final result, but it is much easier to manage. And in most real scenarios, manageability is more valuable than elegance.

Where it breaks

Even with structure, there are clear limits. There are situations where AI does not provide a meaningful advantage, simply because the effort required to validate the result is as high as the effort required to produce it. This typically happens in cases where accuracy is critical and errors are not easily detectable.

Contracts, pricing documents, regulatory content, or any output where a single mistake can have consequences fall into this category. If you cannot be sure that the result is correct, you are forced to review everything in detail. At that point, the initial speed advantage disappears.

The same applies to tasks where you need to fully understand and stand behind the result. Strategic decisions, complex argumentation, or critical communication require a level of ownership that cannot be delegated. You can use AI to explore options or structure your thinking, but the final work still needs to be done and validated by the person responsible.

Another limitation appears when context becomes too large or too fluid. Long co-work sessions, multiple iterations, and large document sets tend to introduce inconsistencies over time. Even with careful structuring, it becomes harder to maintain a clear and stable result.

In all of these cases, the pattern is the same. You cannot identify where potential errors are, and you cannot reduce the validation effort to a manageable level. Which leaves you with only one option: you have to check everything.

And in that moment, AI stops being a shortcut and becomes a detour.

What actually helps in practice

If you accept that AI is strong at getting you started but less reliable when it comes to finishing, the question becomes practical. How do you actually work with it in a way that holds up under real conditions? A few things have worked consistently for me.

First, reduce the scope of what you ask for. Large, open-ended prompts tend to produce results that are hard to evaluate. It works better to split the work into smaller, clearly defined tasks and only provide the information that is actually needed for that step. This keeps the output focused and makes it easier to review.
Second, make the process visible. Instead of just asking for a result, I often ask the system to explicitly document assumptions, decisions, and open questions. That usually leads to a short list of points that need to be checked, rather than forcing you to validate everything from scratch.
Third, control the flow of information. Rather than carrying the full context through multiple iterations, it helps to consolidate intermediate results into shorter, cleaner documents that can be used as input for the next step. This reduces noise, keeps the context manageable, and avoids reprocessing the same content over and over again.
Fourth, separate generation from consolidation. AI is very good at producing options, variants, and partial results. Bringing those together into a consistent, final version is a different task. In many cases, it is more effective to do that part yourself or to run it through a separate, focused step instead of trying to refine everything in one continuous flow.
Fifth, use validation deliberately. One approach that works well is to run the same result through an additional validation step, either with the same tool or with a different one. Asking for a critical review, inconsistencies, or potential errors often surfaces issues that are not obvious at first glance. This does not guarantee correctness, but it significantly increases the chances of catching problems early.

At the same time, this only works if you treat validation as a separate step and not as a side effect. Otherwise, it quickly turns into another iteration loop without a clear outcome.

Finally, be explicit about what should not be delegated. There are parts of the work where the validation effort will always be high or where full ownership is required. In those cases, using AI for exploration can still be useful, but the final output should be handled differently.

None of this is particularly elegant. But it reflects the reality of how these tools behave today.

The jump from 0 to 1 is real

At that point, it starts to feel less like working with a reliable assistant and more like dealing with a colleague who did a decent first pass, but then lost the thread and is no longer able to bring the work to a clean conclusion.

When iteration makes things worse

It is somewhat the same like trying to use a 100 pages pitch from a different proposal, where you end up using 2,3 slides - the time spend reading and changing things exceeds the effort for a greenfield approach by far.

And still, errors slipped through.

That is the complicated part. The issue is not necessarily a lack of diligence., but that once you work with generated content at scale, it becomes significantly harder to maintain a complete overview of what has been produced, changed, and combined across iterations.

At that point, validation becomes a full reconstruction of the logic behind the output.

As a result, AI works very well as a tool for generating intermediate steps, helping to structure thoughts, or accelerating the early phases of a task. It is much less reliable as a final step when accuracy and accountability matter.

Scaling creates a control problem

As long as you use AI on your own, most of the issues are manageable. You generate something, you review it, you correct it, and you move on. The inefficiencies are there, but they stay contained because you know what you asked for and what you got back.

This changes once you move beyond individual use.

That uncertainty changes how you work. You assume a certain level of quality because it comes from a colleague, but at the same time you know that parts of it may be generated and not fully validated. So either you trust it and take a risk, or you re-check it and lose time.

Both options are problemati and because everyone is in the same situation, a subtle dependency builds up. Each person assumes that someone else has already validated the content at an earlier stage. Over time, that assumption becomes the weakest point in the system.

The result is not a single obvious error, but a gradual loss of control across the whole workflow. At that point, the question is no longer whether individual outputs are correct, but whether the overall system still produces reliable results.

This is not a tool problem

At this point, it becomes difficult to explain the situation as a limitation of the tools.

The tools are doing what they are supposed to do. They generate content quickly, they adapt to input, and they support a wide range of tasks. The issues appear when that output becomes part of a larger workflow that depends on consistency and accountability.

In that sense, AI is already closer to a resource than to a traditional tool. It contributes to the output, but it is not managed in the same structured way as other resources in an organization.

A simple comparison helps to make this visible: if a new team member joins, you define what they are responsible for, what level of quality is expected, and how their work will be reviewed. You do not rely on implicit assumptions.

With AI, this structure is often missing. Tasks are assigned without clear boundaries, expectations are not made explicit, and validation happens inconsistently.

The result is predictable. Output increases, but clarity does not.

Control comes from structure, not trust

If you accept that the problem is not the tool but the way work is organized, the next step is quite straightforward.

You cannot rely on trust alone. Not because the tool is bad, but because the output is not transparent enough to make blind trust a viable option. You need a way to work with it that keeps things understandable and verifiable.

This is not about making the process more complicated, but making it controllable. A simple pattern that works in many cases looks like this:

one step focuses on structure and scope

another step generates or refines content

a separate step reviews or challenges the result

These steps can be done by different “agents” or simply as distinct phases in your own workflow. The important part is not who does it, but that the steps are clearly separated and the output of each step can be validated before moving on.

It is less elegant than the idea of one continuous flow that produces a final result, but it is much easier to manage. And in most real scenarios, manageability is more valuable than elegance.

Where it breaks

Contracts, pricing documents, regulatory content, or any output where a single mistake can have consequences fall into this category. If you cannot be sure that the result is correct, you are forced to review everything in detail. At that point, the initial speed advantage disappears.

Another limitation appears when context becomes too large or too fluid. Long co-work sessions, multiple iterations, and large document sets tend to introduce inconsistencies over time. Even with careful structuring, it becomes harder to maintain a clear and stable result.

In all of these cases, the pattern is the same. You cannot identify where potential errors are, and you cannot reduce the validation effort to a manageable level. Which leaves you with only one option: you have to check everything.

And in that moment, AI stops being a shortcut and becomes a detour.

What actually helps in practice

If you accept that AI is strong at getting you started but less reliable when it comes to finishing, the question becomes practical. How do you actually work with it in a way that holds up under real conditions? A few things have worked consistently for me.

Second, make the process visible. Instead of just asking for a result, I often ask the system to explicitly document assumptions, decisions, and open questions. That usually leads to a short list of points that need to be checked, rather than forcing you to validate everything from scratch.

At the same time, this only works if you treat validation as a separate step and not as a side effect. Otherwise, it quickly turns into another iteration loop without a clear outcome.

Finally, be explicit about what should not be delegated. There are parts of the work where the validation effort will always be high or where full ownership is required. In those cases, using AI for exploration can still be useful, but the final output should be handled differently.

None of this is particularly elegant. But it reflects the reality of how these tools behave today.

Comments