Defect Classification February 10, 2025 by Kenji Nakamura

Inline Defect Classification Below 14nm: CNN Architecture Choices and Accuracy Trade-offs

Choosing the right CNN architecture for inline defect classification at 14nm and below is not a theoretical question. It directly affects how quickly an engineer gets a reliable alert versus a noisy one, and whether the model survives a process node retrain without months of relabeling work. We've worked through this decision across several process nodes, and the trade-offs are real enough to be worth documenting carefully.

Why Node Matters for Architecture Selection

At 28nm and above, defect morphology is large enough that a moderately deep CNN with a simple backbone — even a ResNet-18 variant — can achieve acceptable classification accuracy across the main defect families: particle, scratch, bridging, pit, and pattern anomaly. The defect image patches arriving from a KLA Surfscan or CDSEM tool are typically 128×128 pixels, and the distinguishing features are coarse enough to survive aggressive downsampling.

Below 14nm, the situation changes in two ways. First, the absolute feature size shrinks relative to the PSF of the review tool, meaning defect patches contain less high-frequency discriminating information. A bridging defect at 7nm and a pattern anomaly at the same node can look superficially similar in a 64×64 patch. Second, the defect taxonomy expands: EUV-specific stochastic defects, blank defect residues, and multi-patterning overlay artifacts don't have clean analogs in earlier node defect libraries, so a model trained at 28nm won't generalize without per-node retraining.

In our experience, the architecture needs to change along two axes as nodes shrink: patch resolution and feature pyramid depth. Shallow architectures miss multi-scale features that distinguish stochastic defects from pattern noise. We tested both EfficientNet-B2 and ResNet-50 variants on a 7nm defect dataset containing 8 defect families, and the accuracy gap between the two was around 3.4 percentage points overall — but that gap concentrated almost entirely in the two low-frequency defect families (blank residue and stochastic bridging) that require multi-scale context to identify correctly.

EfficientNet vs ResNet: What the Numbers Actually Show

EfficientNet's compound scaling approach — balancing depth, width, and resolution simultaneously — gives it an advantage at advanced nodes precisely because it avoids the common failure mode of depth-only scaling: deep networks that overfit to high-frequency texture at the expense of coarser spatial context. For defect classification tasks where the discriminating signal is sometimes a shape rather than a texture, this matters.

The practical trade-off looks like this across the two architectures on a 14nm KLARF-sourced dataset we benchmarked internally:

Metric	ResNet-50	EfficientNet-B2
Overall accuracy	94.1%	97.3%
Particle recall	97.8%	98.1%
Scratch recall	95.2%	96.4%
Stochastic bridging recall	81.3%	91.7%
Inference time (A30 GPU, per wafer map)	38ms	43ms

ResNet-50 is faster by about 5ms per wafer map on the same hardware. At 43ms per wafer, EfficientNet-B2 is still well inside the latency budget for inline classification — inspection tools at leading-edge nodes run a wafer cadence of roughly 45 to 90 seconds, so 43ms of inference time adds no measurable delay to the alert pipeline. The accuracy improvement on hard-to-classify defect families is worth the 5ms difference.

Patch Size, Input Resolution, and the KLARF Coordinate Problem

KLARF files don't embed defect images directly. They provide defect centroid coordinates in wafer-level XY space (in microns, relative to the wafer center), along with a size estimate and classification metadata. The defect image patches themselves live in a separate review image folder referenced by the KLARF record. Getting the right patch crop requires correctly resolving the KLARF coordinate frame to the review image pixel frame — and this is where a surprising number of pipelines get it wrong.

KLARF version 1.8 and 2.0 differ in how they encode the WaferOrientation field and the SampleCenterLocation offset. A pipeline that reads v1.8 files correctly can silently misalign patch crops on v2.0 files if it doesn't check the version header. We've seen patch crops offset by 4-8 microns on real production files, which means the model is classifying the wrong region of the defect entirely.

Once coordinates are resolved correctly, patch size selection is an empirical question. In our tests on 14nm and 7nm nodes, 128×128 pixel patches at the native review magnification consistently outperformed 64×64 crops: the wider context around the defect centroid captures peripheral texture and nearest-neighbor features that help the classifier distinguish edge-case morphologies. The increase in patch area also means larger input tensors and slightly longer inference time, but as noted above, the latency budget at production cadence is not the constraint.

Per-Node Retraining: Frequency, Trigger, and Data Requirements

A model trained at 14nm will degrade when applied to 10nm or 7nm process runs, even on the same inspection tool family. This is not just because feature sizes change; it's because the defect taxonomy evolves. New process steps introduce new defect failure modes that don't exist in the 14nm training set. EUV introduction at 7nm added stochastic defect types that weren't represented in any prior labeled data.

In practice, we retrain per-node classifiers quarterly, triggered by one of three events: (1) a new process node is introduced on the line, (2) a major change in inspection tool configuration that alters image characteristics (beam current, landing energy, aperture size), or (3) model confidence scores on the currently-deployed classifier drop below a configurable threshold as reviewed by the yield engineer queue. The third trigger is the most important one in production, because drift is rarely announced — it shows up as a gradual accumulation of low-confidence calls that engineers are reviewing manually instead of relying on the model.

Training data requirements are smaller than many engineers expect. In our experience, 2,000 to 4,000 labeled patches per defect family per node are sufficient to reach production-ready accuracy on the main families. The long tail of rare defect types — defect families that account for less than 0.5% of occurrences — requires careful handling: either collect aggressively over a longer period or use a reject bin with manual engineer review rather than attempting low-data classification that will produce unreliable confidence scores.

Confidence Calibration and the Low-Confidence Bin

Raw CNN softmax outputs are not well-calibrated confidence scores. A model that outputs a 0.91 softmax probability for a class is not necessarily 91% likely to be correct on that sample. Temperature scaling is the standard fix: divide logits by a learned scalar before softmax to match empirical accuracy to reported confidence. This is a 30-minute post-training step that makes a material difference in how engineers use the model output.

Every production defect classifier needs an explicit low-confidence bin. Samples where the top-1 probability falls below a calibrated threshold (typically 0.70 to 0.75 in our deployments) should be routed to an engineer review queue rather than assigned a class label. Misclassifications below that threshold are not uniformly distributed — they concentrate on defect morphologies that are genuinely ambiguous, often at process node boundaries where new defect types have not yet accumulated enough labeled examples.

A low-confidence bin is not a failure mode in the classification pipeline. It's the model correctly reporting its own uncertainty — which is exactly what you want before deciding whether to hold a lot.

The ratio of low-confidence calls to total classified defects is a useful monitoring signal. A sudden increase in that ratio often indicates either an equipment change that has shifted image characteristics, or the appearance of a genuinely new defect morphology that the current training set doesn't represent. We surface that ratio as a daily metric alongside the normal classification volume statistics, so engineers can notice drift before it propagates into incorrect batch-disposition decisions.

Practical Notes for Production Deployment

Running the classifier on-premise on NVIDIA A30 hardware — or equivalent cards with at least 24GB VRAM — handles the inference load comfortably for a single fab line at 20,000 wafer starts per month. At 43ms per wafer map and a typical inspection yield of 5 to 15% inline sample rate, the actual GPU utilization at steady state is well under 20%. The same GPU handles spatial pattern engine workloads concurrently without queue buildup.

Model versioning is non-trivial in a production environment. The classifier container needs to carry a version tag tied to the training dataset hash, the process node, and the tool model it was validated on. When a new model version is deployed, the previous version should remain active for a 48-hour shadow period where both model versions run in parallel and their outputs are compared. Divergence above 5% on high-confidence calls flags a review before cutover.

The bottom line: for inline defect classification at 14nm and below, EfficientNet-B2 is the architecture we currently recommend for new deployments. It outperforms ResNet variants where it matters most — rare and morphologically ambiguous defect families — and the 5ms inference latency premium is not a practical constraint at production tool cadence. Per-node retraining is required, quarterly is the right cadence, and confidence calibration is not optional if you want engineers to trust the model output well enough to act on it.

Defect Classification

Back to Engineering Blog