Technical Deep-Dive

Pareto Excursion Analysis: How Commonality Analysis Cuts Root-Cause Time by 2x

Pareto chart with cumulative line showing root cause reduction from commonality analysis

Pareto analysis is the standard opening move in any yield excursion investigation. Rank the failure modes by frequency, draw the cumulative line, and identify the two or three categories that account for 80% of the yield loss. The approach works — as far as it goes. The problem is that it stops at "what failed" and leaves the harder question untouched: which process event, equipment, chamber, or operator action caused the failure, and how do you confirm that attribution before the next lot is affected?

Commonality analysis is the method that answers the second question. It works by correlating the population of failing units against the full set of process history dimensions simultaneously — equipment ID, chamber number, recipe version, lot position, wafer slot, operator shift, and process timestamp — then looking for dimensions where the failing population is statistically non-random. The non-random dimensions are your candidate root causes. The root-cause chain length — how many correlated dimensions you need to traverse before arriving at a single actionable explanation — is what determines how fast you get to a corrective action.

Building the Pareto: Frequency Is Not the Same as Impact

The first thing a Pareto chart does is sort failure modes by count. This is useful for prioritizing investigation resources, but it conflates two quantities that have different yield implications: excursion frequency and excursion impact per event.

Consider a fab running two concurrent excursion modes. Mode A is an edge-ring pattern on the metal-2 layer, occurring on approximately 12% of wafers processed through chamber 3 of a CVD tool. Mode B is a center-spot pattern on gate oxide, occurring on 3% of wafers but with a D0 kill ratio nearly four times higher due to the critical nature of gate dielectric defects. A frequency-only Pareto will rank Mode A higher. An impact-weighted Pareto — where each bar height is frequency multiplied by average yield loss per occurrence — will rank Mode B first.

For excursion triage, the impact-weighted Pareto is the right starting point. For root-cause prioritization, however, you need both axes: the frequency-only view tells you which excursion type is most amenable to rapid commonality analysis (more samples means more statistical power), while the impact view tells you which one costs you the most per day you leave it uninvestigated.

Commonality Dimensions: The Full Set

Effective commonality analysis requires that every wafer in the failing population carry a complete provenance record. At minimum, this means:

  • Equipment ID and chamber number — for multi-chamber tools, chamber-level granularity is required; tool-level attribution will miss chamber-specific effects
  • Recipe name and version CRC — recipe name alone is insufficient; a CRC or parameter hash distinguishes between nominally identical recipes with parameter drift
  • Lot ID and wafer slot position — slot position (1-25 in a standard 300mm FOUP) enables FOUP-level and slot-level attribution when carrier contamination is a suspect
  • Process timestamp and operator shift — shift attribution detects operator-dependent process variation, including loading technique, maintenance completion state, and end-of-shift equipment behavior
  • Upstream process context — the chamber and recipe used at the immediately preceding critical layer, to enable cross-layer correlation

When all five dimensions are populated, the commonality engine can execute a multi-dimensional overlap analysis: for each dimension, compute the conditional probability of appearing in the failing population versus the background population, then rank dimensions by their Kullback-Leibler divergence from the background distribution. The highest-KL-divergence dimensions are the strongest commonality signals.

The Drill-Down Hierarchy and Root-Cause Chain Length

Commonality analysis rarely produces a single-step answer. More often, the first pass identifies a dimension — say, equipment ID — with elevated KL divergence. The second pass filters the data to only the wafers processed on that equipment and reruns the analysis on the remaining dimensions. This inner analysis might identify chamber 2 as the sub-root within that tool. The third pass, filtered to chamber 2, might identify a specific recipe version as the distinguishing factor.

This drill-down hierarchy — tool → chamber → recipe version — is the root-cause chain. Its length (three steps in this example) determines the minimum number of iteration cycles needed to arrive at an actionable corrective action.

The practical question is: what shortens the chain? Three factors matter:

  1. Data provenance completeness — missing chamber ID collapses a three-step chain into a dead end after step one. Every dropped provenance dimension potentially adds false chain length by forcing manual investigation to recover the missing field.
  2. Background population size — the statistical power of the commonality test increases with the size of the non-failing control population. A 12-hour excursion window across three lots gives you perhaps 75 failing wafers and 600 control wafers; a 36-hour window gives you more power but more noise from confounded process changes.
  3. Correlated vs causal attribution — commonality analysis identifies correlation, not causation. A shift-level attribution might reflect an operator effect or it might reflect a tool maintenance event that happened to occur between shifts. The chain must terminate in a physically mechanistic explanation, not a statistical one.

A Scenario: Etch Yield Drop at a Logic Foundry

Consider a 28nm logic foundry running a polysilicon gate etch module with four process chambers. In early 2025, an inline inspection run after gate etch shows a D0 increase of 18 defects/cm² spread across 23 wafers in a two-day window — enough to push yield below the SPC lower control limit on two consecutive lots.

The Pareto on defect classification type shows the distribution is dominated by bridge defects (61%) and edge-shorting patterns (29%). Both defect types are spatially concentrated in the inner die field — not an edge ring. That spatial pattern rules out chamber O-ring contamination as the primary suspect.

Commonality analysis on equipment ID shows that all 23 failing wafers were processed on Chamber 2 of the quad tool — zero failing wafers from Chambers 1, 3, or 4 during the same window. KL divergence for chamber ID is 2.7 nats versus a background distribution that allocates roughly 25% of wafers to each chamber.

The second-pass analysis, filtered to Chamber 2, compares recipe version CRC. The 23 failing wafers split into two recipe version subgroups: 19 wafers carry CRC 0xA4F2 and 4 carry CRC 0xB301. The background Chamber 2 population shows a roughly 50/50 split between these two versions. The KL divergence for CRC 0xA4F2 is 1.9 nats — statistically strong, not overwhelming.

Investigation reveals that recipe version 0xA4F2 on Chamber 2 had an etch time parameter 1.8 seconds longer than the nominal spec, introduced as part of a scheduled etch selectivity tuning run three days earlier. The tuning change was logged in the recipe history but the version CRC update was not propagated to the MES routing table — meaning the change was invisible to the lot disposition system. The commonality signal on recipe CRC identified the discrepancy; manual recipe comparison confirmed the parameter difference.

Root-cause chain length: three steps (tool → chamber → recipe version). Elapsed analysis time from excursion detection to chain terminus: approximately 45 minutes using automated commonality analysis on the collected SECS/GEM provenance data.

The 80/20 Rule and Its Limits

The classic Pareto principle — 80% of effects come from 20% of causes — is a reasonable heuristic for mature process nodes where the dominant loss modes are well-characterized. At process node transitions, early-stage fabs, or during introduction of new process modules, the distribution often looks more like 50/50 or even flatter, with many roughly equal-weight causes contributing to yield loss.

In flat distributions, the standard Pareto chart is less useful as a prioritization tool because there is no dominant bar to chase. The Pareto front — the set of non-dominated solutions when you trade off frequency against impact — becomes more meaningful: it identifies the cluster of excursion types that collectively offer the best yield recovery per unit of investigation effort, rather than singling out one dominant bar.

We're not saying Pareto analysis is the wrong tool — it's the right first tool. We're saying it's incomplete on its own, and that the yield teams who close root-cause loops fastest are the ones who move immediately from the Pareto rank order into dimension-by-dimension commonality analysis, rather than stopping at the ranking and beginning a manual search.

Automating the Commonality Pass

The computational cost of a full multi-dimensional commonality analysis on a 24-hour excursion window (typically 100-300 failing wafers, 1,000-3,000 control wafers, 5-8 provenance dimensions) is low — well under one second on modern server hardware when the provenance table is properly indexed. The analysis can run automatically on every excursion event, delivering a ranked dimension list to the yield engineer's dashboard within seconds of the excursion being flagged.

The yield engineer's job then shifts from "which spreadsheet do I open first?" to "does this commonality attribution make physical sense?" That's a better use of a domain expert's time. The manual step remains essential — statistical attribution must always be confirmed by engineering judgment — but it operates on a pre-filtered candidate list rather than a blank slate.

In a SQL-based yield analytics schema, the commonality query structure looks something like this:

-- Commonality: chamber attribution for excursion lot window
SELECT
    prov.chamber_id,
    COUNT(*)                                    AS n_fail,
    SUM(COUNT(*)) OVER ()                       AS total_fail,
    background.n_background,
    ROUND(COUNT(*)::numeric / total_fail, 4)    AS p_fail,
    ROUND(background.n_background::numeric
          / background.total_background, 4)     AS p_background
FROM
    defect_events de
    JOIN wafer_provenance prov USING (wafer_id)
    JOIN (
        SELECT chamber_id,
               COUNT(*) AS n_background,
               SUM(COUNT(*)) OVER () AS total_background
        FROM wafer_provenance
        WHERE process_timestamp BETWEEN :window_start AND :window_end
        GROUP BY chamber_id
    ) background USING (chamber_id)
WHERE
    de.excursion_flag = TRUE
    AND de.layer_id = :target_layer
GROUP BY prov.chamber_id, background.n_background,
         background.total_background
ORDER BY p_fail DESC;

The query above gives the raw conditional distributions. The KL divergence across the distribution is then computed in the application layer. For tight excursion windows with fewer than 50 failing samples, Fisher's exact test is a more appropriate significance metric than KL divergence — KL requires sufficient sample size to be meaningful. The analytics system should switch between them automatically based on the failing population size.

The goal throughout is the same: arrive at the shortest defensible root-cause chain, fast enough to act before the next affected lot leaves the excursion module.

Back to Blog