From 69% to 97% Accuracy: Engineering an LLM to Diagnose Optical Aberrations
TL;DR
Designing an optical system, like a camera lens or a microscope objective, takes months. Most of that time goes into an iteration loop where you look at the current design, figure out what's wrong, change something, and then figure out what's wrong all over again. In an optical system, "what's wrong" means figuring out the reasons why light is coming out of the system blurrier, more spread out, or more distorted than the idealized model would predict. These effects are called aberrations. Often the hardest part of the iteration loop is figuring out which aberrations are the bottleneck; addressing them is more tractable once the target is clear.
If an optical engineer were to task Claude Opus 4.7 with this diagnosis procedure, they can expect to get around 69% accuracy at identifying which aberrations are the problem. Not only is this accuracy poor, but it's also exacerbated by massive variance, making the output practically useless. To address these weaknesses, I built the Optical Mental Model, which executes diagnosis with 97% accuracy and dramatically smaller variance. It works by providing the LLM with the structured understanding of optical design that it doesn't learn from training alone.
Why is aberration diagnosis hard?
Aberration diagnosis is difficult because the goal isn't simply driving these aberrations to zero. Every real lens system has nonzero aberrations and a "good" design effectively balances them against each other to achieve a performance goal. In practice, this means some aberrations are indeed driven to zero, some are left in place to cancel against others, and then some are the true limiting factors holding the design back. With many kinds of aberrations interacting in non-linear ways, an engineer can't just look for the biggest one and drive it to zero.
Methodology
To get a better sense for how Claude Opus 4.7 performs on aberration identification, I analyzed 50 different optical designs hand-curated from popular textbooks. The ground truth is the author's commentary on each design, which cleanly breaks down into three categories: (1) aberrations the author labels as well-corrected, (2) aberrations labeled as residuals, and (3) a ranking of which residual "limits the design." To develop the evaluation corpus, I transcribed the full lens specifications from each textbook into standardized optical prescriptions and translated the author's surrounding commentary into structured rubrics of must-mention and must-not-claim assertions.
An important note is that I chose to ignore any aberration the author did not mention by name. In other words, the LLM was not penalized for naming additional non-limiting aberrations or staying silent on them as long as it did not contradict the author's commentary.
As our baseline, Claude Opus 4.7 was given the same information an optical engineer would use to diagnose a design by hand. Each of these designs was evaluated 10 times to account for possible variance. The pipeline for each trial was as follows: I pasted the optical prescription (i.e. the numbers that define an optical system) plus an 11-section analysis of the prescription (Seidel, ray fans, OPD, MTF, Zernikes, chromatic shift, etc.) in a single message. I also included a short system prompt explaining the agent's role as an optical aberration diagnoser in order to standardize the response format across rounds. The model's response was then graded with Claude Sonnet 4.6 as a judge. The judge's role was to determine whether the free form response from the diagnoser model hit each of the criteria specified in the evaluation case. Its role was minimal; essentially, it was used as a more powerful regex matcher. To validate the judge, I hand-graded a random sample of its verdicts spanning passing and failing trials across all 50 cases, and the judge's grades agreed with my own reading in every instance.
Baseline Claude results
The baseline model suffers from high variance, with 20 of the 50 cases oscillating between 0 and 100 percent of criteria covered across rounds. When the baseline is wrong, whether consistently on a specific design or randomly across rounds, the wrong-round answers are clustered into two main failure modes.
The first failure mode of the baseline is treating the raw magnitude of Seidel/Zernike values as the only analysis metric that matters. In one example, a naive reading of "biggest value = problematic aberration" leads the LLM to name distortion as the limiter when the correct answer is oblique spherical aberration. In another design, the chromatic longitudinal aberration sits well below the λ/14 RMS diffraction-limit threshold (Strehl ≥ 0.8) at −0.042 waves, so the LLM labels axial color as corrected. However, the author explains that axial color is a real chromatic residual that lives in higher orders.
The second failure mode of the baseline is naming the right "family" but the wrong specific aberration. For example, in one design the LLM consistently names the secondary spectrum as the dominant residual when the actual issue is spherochromatism. Both of these aberrations belong to the chromatic "family" but the LLM labels secondary spectrum as the limiter rather than higher-order spherochromatism. In another case, the correct answer is "secondary spectrum of lateral color", but the baseline consistently names the secondary spectrum of axial color as the limiter across all 10 rounds instead.
The remaining baseline errors are a long tail of one-off failures: sign omissions, cross-section confusions, miscellaneous lookalikes.
Failure-aware Claude results
I spent months exploring various possible versions of strictly system prompt engineering. For example, if I include the two concrete failure modes from above, we would expect the scores of the LLM to increase. However, on average the LLM performs nearly the same.
At first glance, it seems like the system prompt additions did not affect the results. While that's true for the overall statistics, case-by-case it is not: 105 rounds got better and 113 rounds got worse.
After a thorough analysis of each case, we can conclude that the lack of improvement stems from the LLM applying the advice uniformly rather than selectively. On designs whose true limiter was a primary aberration, it would reach instead for a higher-order explanation. On others, it would initially name the correct residual, only to reason itself to conclude a different aberration within the same family. This weakness stems from the central complexity of optical aberration diagnosis: sometimes, one number represents the whole story; other times, numbers interact in subtle ways such that local results cannot be examined in isolation. An LLM cannot be taught to distinguish these cases from system prompting alone.
Building the Optical Mental Model (OMM)
To overcome the fundamental limitations of system prompting alone, I built the Optical Mental Model. The OMM combines two layers of engineering: a deterministic layer that pre-processes the analysis output into structured aberration signatures, and a system prompt that scripts the LLM's reasoning workflow.
The deterministic layer contains 14 aberration detectors. Each extracts a different signature from the raw analysis data. This data includes Zernike mode coefficients sampled at multiple field points and wavelengths, ray-fan polynomial coefficients (3rd, 5th, 7th order), field-growth slopes of individual modes, wavelength-dispersion slopes, sign reversal patterns, pupil cross-ratios (margin vs 0.707 pupil zone), diffraction-limit-anchored magnitudes (λ/14 RMS, Strehl ≥ 0.8), form-class structural matches, and others. The detectors collectively translate the wall of raw numbers into structured evidence for which named aberrations are present. Then, a downstream model reads the 14 signatures together and outputs which residuals are limiting the design, rather than just listing the output of each detector on its own.
To make this concrete, I'll dive into the oblique spherical aberration detector. The textbook definition for oblique spherical aberration is a fifth-order field-aperture cross-term, spherical-shape aberration that grows with field angle. The challenge is that at off-axis field points, the wavefront error is a sum of many aberration modes stacked together (e.g., coma, astigmatism, field-grown spherical, and others). Therefore, an engineer can't read oblique spherical off any single number.
The detector exploits an orthogonality property of Zernike mode decomposition. At any field point, the wavefront fits onto a sum of orthogonal Zernike modes where Z5/Z6 carry astigmatism, Z7/Z8 carry coma, and Z9 carries spherical. So Z9 cleanly isolates the primary spherical-shape component of the wavefront regardless of other modes. The detector samples Z9 at the on-axis field point (the system's primary spherical baseline) and at the outermost field point, takes the difference, and normalizes it by the λ/14 RMS diffraction-limit floor (Strehl ≥ 0.8). The signal |Z9,off − Z9,on| × 14 is dimensionless. A value of 1 means the field-grown Z9 alone is at the diffraction limit; 10 means an order of magnitude past it.
The naive approach to detecting off-axis spherical aberration is taking the ratio of off-axis to on-axis RMS wavefront error. It seems like a reasonable heuristic at first glance, but for oblique spherical, it has two failures. First, it's non-specific in that off-axis RMS growth could come from astigmatism, field curvature, or oblique SA, and the ratio doesn't distinguish them. Second, it's mathematically unstable because when the on-axis RMS is small, the ratio explodes. The Zernike isolation approach addresses both. This describes just one of the 14 detectors. The same naive-vs-robust battle played out for each of the others as well.
Each detector isolates a different signature, each comes with its own design tradeoffs, and each was cross-checked against textbook commentary across the eval corpus. The months I spent building the OMM were a systematic iteration against the ground truth, often starting from what ended up being a naive approach for each detector and engineering it into a robust algorithm that fires correctly on real failures and stays quiet on corrected cases.
Optical Mental Model results
I added the OMM to Claude Opus 4.7 and ran it through the exact same eval pipeline as the baseline: same judge, same rubric, and same data.
The Optical Mental Model scores 100s across all 10 rounds on 39 out of 50 cases, with an aggregate accuracy of 97.02% across all 500 case-rounds. Of the 11 cases that aren't perfect, 6 land above 90% on average and miss on just 1-3 rounds, typically due to a wording slip or a close-call ranking judgment. The other 5 average between 70% and 90%, where multiple strong detector signals make the agent's ranking decision noisier and individual rounds occasionally dip below 70%. Those dips come from residual LLM stochasticity still slipping through the OMM's scaffolding, which we'll address in future work.
How does the Optical Mental Model accelerate optical design?
The results presented here are a first measurement, not a final one. The OMM was built by iterating detectors against a hand-crafted textbook corpus, and the next step is to grow its coverage to commercial-grade lenses, broader system classes (reflective, catadioptric, folded, non-rotationally symmetric), plus human-blind grading by working optical engineers.
Still, until now, the best use-cases for LLMs in optical design have been writing Zemax macros, API scripts, and refreshing on old textbook optical concepts. Asking an LLM for aberration diagnosis information, let alone design actions, has not been reliable enough to make it worth the effort. But now, the Optical Mental Model offers a verdict reliable enough to use as a real first-pass analysis.
Practically, the OMM fits in at the moment when an optical engineer would normally sit down with the ray fans, OPD plots, MTF curves, or Seidel coefficients and figure out what's going wrong with their design. The OMM produces a diagnosis programmatically, grounded in the same reasoning a senior engineer would apply. It doesn't replace an optical engineer; rather, it adds a new tool to their toolbelt.
Conclusion and Future Outlook
The Optical Mental Model works because the deterministic analysis happens in code, leaving the LLM only the synthesis of that analysis. While I applied this pattern to aberration diagnosis, it generalizes beyond, such that meaningful AI tools for other engineering domains with strong analytical backbones could benefit from a similar approach.
For optics, aberration identification is only the first piece of the puzzle. Next is the design step itself, where I envision the LLM will be able to seamlessly translate its diagnosis into a targeted merit-function modification.