How to bulk-classify a SKU catalog to HTS codes

Q: What's the cheapest way to classify a small catalog (under 200 SKUs)?

Hand-classify. A broker can do 200-400 SKUs in a day from HTSUS text + CROSS rulings, with a thorough audit trail per SKU. API options pay off above ~500 SKUs.

Q: How accurate is GPT for HTS classification?

GPT-class models hit 60 to 75% top-1 accuracy on published benchmarks, with hallucinated codes as the main failure mode. Retrieval-grounded classifiers (lexical or multi-layer) lift that to 90%+. Always cross-check cold-LLM output against hts.usitc.gov.

Q: What's the difference between an HTS code lookup and HTS classification?

Lookup means code-in, data-out (duty rate, scope, hierarchy). Classification means description-in, code-out, with judgment about ambiguous cases. Lookup is deterministic; classification is reasoning over GRI hierarchy plus CBP precedent.

Q: Are HTS codes from the API legally binding?

No. The importer of record carries legal responsibility under 19 USC 1484. API output is a research aid. For high-stakes lines, pull the underlying CROSS binding ruling, read the GRI hierarchy, and request a CBP binding ruling at rulings.cbp.gov where uncertainty remains.

Q: What rate limits should I expect on a public HTS search endpoint?

Public endpoints typically throttle above a few hundred requests/minute. For sustained 10,000+ SKU loads, expect to need an authenticated tier with per-key limits. Always read the policy before a bulk run; getting soft-banned mid-run is the most common operational failure.

Q: How do I structure a SKU description so the classifier picks the right code?

Lead with the noun, then material, form factor, end use. Drop brand and marketing copy. Where the product is composite, name both materials so the GRI 3(b) essential-character test resolves correctly.

Q: Can these classifiers handle Spanish or Chinese product descriptions?

Most US HTS classifiers are English-only because the HTSUS itself is English. Translate first (DeepL, GPT-4, Google Translate), then classify. Flag rows with non-ASCII characters before sending.

Q: What confidence threshold should I auto-accept versus route to a human?

Start with auto-accept above 0.85, broker review 0.60 to 0.85, importer or CBP eRulings escalation below 0.60. Calibrate against a hand-classified gold set of 100 to 200 SKUs from the same catalog.

TL;DR

Bulk HTS classification turns a column of free-text product descriptions into a column of 10-digit HTS codes plus a confidence score per row. Done well, it lets a single broker run a 50,000-SKU catalog through tariff impact analysis in an afternoon.
Three tooling tiers exist: (1) a cold LLM call (GPT, Claude), cheapest and worst, prone to hallucinated codes; (2) a retrieval-grounded search against the live HTSUS, which eliminates hallucinations but trades on paraphrase recall; (3) a multi-layer classifier that combines CROSS rulings, lexical retrieval, and semantic embeddings, the strongest option but not always free. Workflow choice depends on catalog size, accuracy bar, and budget.
Confidence scoring is the load-bearing part regardless of tool. Auto-accept top-1 candidates above 0.85, send 0.60 to 0.85 rows to a broker for review, escalate below 0.60 to the importer compliance team or to a CBP eRulings request at rulings.cbp.gov.
Legal responsibility under 19 USC 1484 sits with the importer of record. Any classifier output, from any vendor, is a research aid. The audit trail (which CROSS ruling was cited, which heading text was applied, which GRI was determinative) is what makes the result defensible at liquidation.

Why bulk classification is hard

The HTSUS has roughly 19,000 active 10-digit subheadings. The legal text that defines each one is written in 1970s tariff prose. The General Rules of Interpretation (GRI 1 through 6) force a strict hierarchy: classify by heading text first, then by chapter and section notes, then by essential character (for composite goods), then by the heading appearing last in numerical order (the GRI 3(c) tiebreaker). A SKU description that says "stainless steel kitchen knife with plastic handle" could land at 8211.92.20.00 (Kitchen and butcher knives), at 8211.91.50.60 (Knives with rubber or plastic handles, other), or back at 8211.10.00.00 (Sets of assorted articles), depending on GRI sequencing.

A single broker classifying by hand can do maybe 200 to 400 SKUs a day, which is fine for a single shipment but breaks for ecommerce catalogs at 10,000+ SKUs and for importers running a tariff impact analysis after a 232 expansion. Bulk classification is the workflow to clear that.

Three failure modes the workflow has to solve

Hallucinations. A naive GPT or Claude call will return plausible-looking 10-digit codes that do not exist in the HTSUS. The standard fix is to ground the classifier in a live HTSUS index so every candidate has to exist by construction; failure mode shifts to "no confident match" rather than "wrong-looking match."

Composite goods. A polyester knit running short with a cotton waistband, or a wireless earbud with charging case (two separate articles), forces a GRI 3(b) essential-character call. The classifier needs to surface the call explicitly, not bury it in a single confidence number.

Drift. The HTSUS revises every January (and out-of-cycle when proclamations land, like the April 6, 2026 Section 232 consolidation). A classifier trained on last year's HTSUS will silently misroute the SKUs whose code changed. Pick a tool whose index pulls from hts.usitc.gov on a regular cadence and confirm staleness in writing.

Tooling options

Three tiers, ordered by accuracy. Pick the one that matches your catalog size and accuracy bar.

Tier 1: Cold LLM call (GPT, Claude)

Cheapest. Worst accuracy. Top-1 lands in the 60 to 75% range across published benchmarks, with hallucinated codes (codes that do not exist in the HTSUS at all) as the dominant failure mode. Use only if the catalog is small (under 200 SKUs) and a broker will hand-review every row anyway. Always cross-check every returned code against the live HTSUS at hts.usitc.gov before accepting.

Tier 2: Retrieval-grounded HTS search

A lexical (BM25-style) search over the HTSUS subheading definitions and chapter/section notes. Eliminates hallucinations because every candidate must exist in the subheading list. Strong on exact noun matches, weak on paraphrase recall. The Tandom catalog at tariffs.tandom.ai/hts-catalog surfaces this style of retrieval; the same engine sits behind most paid Classifier services as well. Adequate for ecommerce ops who can tolerate manual triage on borderline rows.

Tier 3: Multi-layer classifier (CROSS + lexical + semantic)

Composes three retrieval strategies: CBP's CROSS binding-ruling database (~250,000+ rulings), lexical retrieval over HTSUS text, and semantic embeddings. Cross-validated top-1 (a code that appears across multiple layers) is high-confidence; conflicts trigger triage. Highest accuracy (top-1 lifts above 90% on published apparel/electronics benchmarks). Several vendors operate in this tier; pick on integration fit, pricing, and how often the underlying HTSUS index refreshes.

Tandom's next product launch

Tandom HTS Classification is the next Tandom product to ship, currently in closed beta. The three-layer engine described in Tier 3 (CROSS rulings + lexical + semantic embeddings) with a defensible audit trail per SKU.

Join the waitlist →

Worked example, 10 SKUs

Ten product descriptions a typical ecommerce ops team would push through bulk classification. The script structure below works against any retrieval-grounded HTS search endpoint that returns ranked candidates per query. Every code in the result table was verified against the live HTSUS at hts.usitc.gov.

The Node.js script

// classify.mjs
// Loops a list of SKU descriptions through an HTS search endpoint
// and reports the top candidate plus a confidence band. Adapt the
// ENDPOINT + response shape to whichever classifier you're using.

const ENDPOINT = "https://tariffs.tandom.ai/api/hts/search";

const skus = [
  { id: "SKU-001", desc: "polyester knit shorts mens athletic" },
  { id: "SKU-002", desc: "kitchen knife stainless steel chef" },
  { id: "SKU-003", desc: "bluetooth wireless earphones charging case" },
  { id: "SKU-004", desc: "nylon backpack laptop compartment" },
  { id: "SKU-005", desc: "plastic toothbrush manual adult" },
  { id: "SKU-006", desc: "passenger car tire 17 inch radial" },
  { id: "SKU-007", desc: "cordless vacuum cleaner handheld" },
  { id: "SKU-008", desc: "led flashlight aluminum body" },
  { id: "SKU-009", desc: "yoga mat synthetic foam" },
  { id: "SKU-010", desc: "polyester webbing strap industrial" },
];

async function classify(desc) {
  const url = `${ENDPOINT}?q=${encodeURIComponent(desc)}&limit=5`;
  const res = await fetch(url);
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  const data = await res.json();
  return data.results || [];
}

function band(results) {
  if (results.length === 0) return { code: null, score: 0, band: "no-match" };
  // Cheap proxy: top-1 with a 10-digit (full statistical) code is high confidence.
  const top = results[0];
  const isFullCode = /^\d{4}\.\d{2}\.\d{2}\.\d{2}$/.test(top.htsno);
  if (!isFullCode) return { code: top.htsno, score: 0.55, band: "low" };
  if (results.length >= 3) return { code: top.htsno, score: 0.80, band: "mid" };
  return { code: top.htsno, score: 0.92, band: "high" };
}

(async () => {
  for (const sku of skus) {
    const results = await classify(sku.desc);
    const verdict = band(results);
    console.log(JSON.stringify({ ...sku, ...verdict }));
  }
})();

Output, after a Tier 2 search and broker review

The script above is the bulk-loop scaffold. The HTS codes in the table below are the codes a broker would settle on after seeing the search candidates. Lexical-only retrieval against multi-word phrases like "polyester knit shorts mens athletic" often returns no top-1 match because the legal HTSUS text uses formal phrasings; in those cases the broker reads the top-N candidates and picks. A multi-layer Tier 3 classifier closes the gap automatically by adding CROSS retrieval and semantic embeddings, but the broker review step never goes away entirely on borderline rows.

Bulk run10 SKU descriptions, top-1 candidates from /api/hts/search

Auto-accept6 / 10

SKU	Description	Top HTS	Band	Verdict
SKU-001	polyester knit shorts mens athletic	6103.43.15.40Mens shorts of synthetic fibers, knit	high	Auto-accept
SKU-002	kitchen knife stainless steel chef	8211.92.20.00Kitchen and butcher knives	high	Auto-accept
SKU-003	bluetooth wireless earphones charging case	8518.30.20.00Headphones, earphones, other	low	Escalate, GRI 3(b) call on the case
SKU-004	nylon backpack laptop compartment	4202.92.31.20Backpacks of textile materials	high	Auto-accept
SKU-005	plastic toothbrush manual adult	9603.21.00.00Toothbrushes, including dental-plate brushes	high	Auto-accept
SKU-006	passenger car tire 17 inch radial	4011.10.10.50New pneumatic tires, motor cars, radial	mid	Broker spot-check, AD/CVD adjacent
SKU-007	cordless vacuum cleaner handheld	8508.11.00.00Vacuum cleaners with self-contained electric motor	high	Auto-accept
SKU-008	led flashlight aluminum body	8513.10.40.00Portable electric lamps, other	high	Auto-accept
SKU-009	yoga mat synthetic foam	9506.91.00.30Articles for general physical exercise, other	mid	Broker spot-check, possible 39.26 alternative
SKU-010	polyester webbing strap industrial	6307.90.98.91Other made-up textile articles, other	low	Escalate, end-use ambiguous

The auto-accept rows flow straight into the catalog enrichment pipeline. The mid-confidence rows go to a broker for one-line confirmation against the heading text. The low-confidence rows (the wireless earbud charging-case combo here, where two products complicate GRI 3(b)) get a CBP CROSS search or a binding-ruling request through rulings.cbp.gov.

Triage low-confidence rows

Confidence-banding is what makes bulk classification trustable. Without it, a 5,000-SKU file produces a 5,000-row spreadsheet that nobody knows whether to trust. With it, the broker reviews maybe 200 rows.

The 0.85 / 0.60 / below-0.60 split

A working policy that holds up across most ecommerce catalogs:

Above 0.85: auto-accept. Code flows into the catalog. Spot-audit 1 in 50 against the broker's hand-classified gold set. Investigate any chapter-level disagreement.
0.60 to 0.85: broker reviews. One-line spot-check against the HTSUS heading text. Most go through; the ones that don't get bumped to the next bucket.
Below 0.60: escalate to importer compliance. For high-stakes lines (AD/CVD adjacent, novel product, dispute history), the importer requests a CBP binding ruling through rulings.cbp.gov. Binding rulings take 30 to 90 days but eliminate classification risk for the lifetime of that product.

Calibrate against your own gold set

The 0.85 / 0.60 thresholds are starting points, not universals. The right thresholds depend on the catalog. A single-product-family catalog (apparel only, electronics only) calibrates much tighter; a generalist marketplace stays wider. Hand-classify 100 to 200 SKUs from the catalog, run the classifier on those same SKUs, and pick thresholds where auto-accepted rows are 98%+ correct against the gold set.

What to do with conflicts across layers

When CROSS retrieval points to chapter 84 and lexical retrieval points to chapter 39 for the same SKU, the CROSS layer ordinarily wins because it's grounded in a CBP determination. The exception is when the CROSS ruling is older than 5 years and the HTSUS subheading text has changed since (an annual revision or out-of-cycle proclamation). In that case, treat the CROSS ruling as advisory and request a fresh binding ruling.

Common pitfalls

The mistakes that cost the most time when running bulk classification at scale.

Trusting a top-1 GPT call

A cold GPT-4 or Claude call against a SKU description returns a plausible-looking code in well-formed dotted notation. The code might not exist in the HTSUS at all. The fix: every candidate code from any model must be cross-checked against the live HTSUS index at hts.usitc.gov (or any retrieval-grounded classifier) before it's accepted.

Ignoring chapter and section notes

Chapter and section notes carve specific products into or out of headings that the heading text alone would not predict. Chapter 90 Note 1, for instance, excludes goods of base metal unfit for any use other than mounting in instruments. A SKU that lexically reads as "optical instrument" can be redirected out of chapter 90 by a section note. Pick a classifier whose index includes chapter and section notes; a thin subheading-text-only match misses these calls.

Composite goods and GRI 3(b)

A leather wallet with steel zipper, a polyester running short with cotton waistband, a wireless earbud with charging case: GRI 3(b) requires the importer to identify the component that gives the article its essential character. Lexical retrieval tends to land on whichever component the description mentions first. Always confirm composite-goods classifications by reading the General Explanatory Notes (a public GRI 3(b) reference) before auto-accepting.

Set classification (GRI 3(b) sets)

When the SKU is a true set (knife block with 12 different knives, kitchen utensil set with 8 pieces), the classification might pivot to a single heading covering the set rather than summing the components. Heading 8211.10.00.00 ("Sets of assorted articles") with the rate of duty applicable to the set's highest-rated component is the classic example. Bulk-classification scripts that assume one SKU equals one single-product code miss these.

Stale classifications after an HTSUS revision

The HTSUS revises every January. Out-of-cycle revisions (like the April 6, 2026 Section 232 consolidation that re-routed derivative codes from 9903.80 / 9903.81 / 9903.85 to the new 9903.82 family) silently invalidate prior classifications. A monthly bulk re-classification pass against an HTSUS index that pulls from hts.usitc.gov on a daily cadence catches drift before it shows up as a Notice of Action from CBP.

Translation errors on non-English descriptions

Catalogs sourced from a global SKU master often carry mixed languages. The classifier is English-only as of May 2026. Detect non-ASCII characters in the description column before sending; translate first, then classify, and flag any row where the translation introduced ambiguity (often the case with technical product specs).

Confidence inflation on short descriptions

A two-word description ("running shorts") will score confidently against 6103.43.15.40 even when the actual product is woven (chapter 6203, not 6103). Lexical retrieval matches word-for-word and does not know the difference between knit and woven. Enforce a minimum description length (5 to 8 tokens) before auto-accepting.

AD/CVD-adjacent codes

Some HTS codes are advisory-flagged in active AD/CVD orders (Steel Threaded Rod 7318.15.x, Steel Nails 7317.00.55, certain aluminum extrusions 7610.x, hardwood plywood 4412.x). A confident classification at one of these codes does not mean the product is in scope (HTS in AD/CVD orders is advisory; the product-scope language controls), but it does mean the broker must run a scope check before assessing duty. ITA's ACCESS at access.trade.gov and CBP's AD/CVD search tool both surface the active order list per HTS heading.

Treating the API output as legally binding

Under 19 USC 1484, the importer of record bears legal responsibility for classification. API output is a research aid. Document the classification rationale per SKU (which CROSS ruling was referenced, which heading text was applied, which GRI was determinative) so a CBP audit trail exists.

Glossary

HTS / HTSUS: Harmonized Tariff Schedule of the United States. 10-digit codes; first 6 digits are international, last 4 are US-specific statistical breakouts. Authoritative source: hts.usitc.gov.
GRI: General Rules of Interpretation, six legal rules that govern HTSUS classification. GRI 1 (heading text), GRI 2 (parts and unfinished articles), GRI 3 (mixtures, composite goods, sets), GRI 4 (most-akin), GRI 5 (cases and packing), GRI 6 (subheadings).
CROSS: Customs Rulings Online Search System. CBP's database of roughly 250,000+ binding classification rulings. Each ruling assigns a specific product to a specific HTS code and is authoritative for substantially similar products.
Binding ruling: A written CBP determination that a specific product is classified at a specific HTS code (or qualifies for a specific origin treatment, valuation method, or marking). Requested at rulings.cbp.gov; takes 30 to 90 days.
Essential character (GRI 3(b)): For composite goods, the test for which component imparts the essential character to the article. Drives the classification heading.
Multi-layer classifier: A classifier that composes multiple retrieval strategies (typically CROSS binding-ruling retrieval, lexical retrieval over HTSUS text, and semantic embeddings) and re-ranks against a confidence score. Most paid HTS classifiers fall into this tier.
Top-1 accuracy: Fraction of test SKUs where the classifier's highest-scored candidate matches the broker's hand-classified gold-standard code. Standard benchmark for classification quality.
Hallucinated code: A model-generated HTS code that does not exist in the active HTSUS. The standard failure mode for cold LLM calls; absent by construction in retrieval-grounded systems.
Confidence band: A discretized confidence score (high, mid, low) used to route classifications to auto-accept, broker review, or importer escalation buckets.
Lexical retrieval: BM25-style full-text search. Strong on exact-noun matches, weak on paraphrase. The simplest retrieval-grounded approach and the default behind most public HTS search endpoints.
Semantic embedding: A neural representation of a description's meaning as a vector. Two paraphrased descriptions of the same product produce nearby vectors. Robust to paraphrase, weaker on generic short descriptions.
Notice of Action (CBP Form 29): CBP-issued correction notice when an entered HTS code is disputed during liquidation. Triggers protest rights at 19 USC 1514, 180 days.

FAQ

High-intent questions ecommerce ops, brokers, and compliance teams ask most often.

What's the cheapest way to classify a small catalog (under 200 SKUs)?

Hand-classify it. A broker working from the HTSUS heading text and CBP's CROSS database (rulings.cbp.gov) can do 200 to 400 SKUs in a day. At that scale, the audit trail per SKU (which CROSS ruling, which heading, which GRI was determinative) is also more thorough than any classifier output. API options are an investment that pays off above ~500 SKUs; below that, the engineering setup eats more time than it saves.

How accurate is GPT for HTS classification?

GPT-class models pulled cold against a SKU description land in the 60 to 75% top-1 accuracy range across published benchmarks, dropping into the 40s on adversarial cases (multi-material composites, novel products, ambiguous textiles). The failure mode is plausible-looking codes that do not exist in the current HTSUS. Retrieval-grounded classifiers (lexical search against the HTSUS, or multi-layer systems that add CROSS rulings and semantic embeddings) eliminate hallucinations by construction and lift top-1 to 90%+ on the same test sets. If you have to use a cold LLM, every returned code must be cross-checked against hts.usitc.gov before acceptance.

What's the difference between an HTS code lookup and HTS classification?

Lookup means you already have the code and want the duty rate, scope notes, or hierarchy. Tools like the USITC HTSUS at hts.usitc.gov or the Tandom catalog at tariffs.tandom.ai/hts-catalog handle lookups. Classification means you have a product description and need the code; that's where this guide focuses. Lookup is deterministic, classification is judgment. Brokers do classification when an importer's product is novel, when CBP issues a Notice of Action that disputes a code, or when a new SKU launches and the HTS field is blank.

Are HTS codes from the API legally binding?

No. The importer of record is legally responsible for classification under 19 USC 1484. Any classifier output, from any vendor, is a research aid. For high-stakes lines (new product categories, multi-material composites, AD/CVD-adjacent goods), brokers should pull the underlying CBP CROSS binding ruling, read the General Rules of Interpretation hierarchy, and where uncertainty remains, request a binding ruling from CBP through the eRulings portal at rulings.cbp.gov.

What rate limits should I expect on a public HTS search endpoint?

Open public endpoints (no API key) typically tolerate small bursts but throttle sustained pulls above a few hundred requests per minute, often via Cloudflare or a similar layer. For sustained bulk loads (10,000+ SKUs in a sitting), expect to need an authenticated tier with explicit per-key limits and pricing. Read the rate-limit policy of whichever endpoint you're calling before wiring up a 50,000-SKU job; getting your IP soft-banned mid-run is the most common operational failure.

How do I structure a SKU description so the classifier picks the right code?

Lead with the noun, then the material, then form factor, then end use. "Polyester knit shorts, men's, athletic" classifies more reliably than "running shorts." Drop brand names, marketing copy, color, and size. The classifier is matching against legal text in the HTSUS subheading definitions, not Amazon listings. Where the product is composite (steel-and-plastic, leather-and-textile), say so explicitly because the General Rules of Interpretation 3(b) essential-character test pivots on it.

Can these classifiers handle Spanish or Chinese product descriptions?

Most US HTS classifiers are English-only as of May 2026 because the underlying HTSUS legal text is English. The recommended pattern is translate first (DeepL, GPT-4, Google Translate), then classify. Mixed-language descriptions in a bulk file are a common reason for low confidence, so the triage queue should flag any row that contains non-ASCII characters in the description column.

What confidence threshold should I auto-accept versus route to a human?

A reasonable starting policy: auto-accept top-1 candidates with score above 0.85, route 0.60 to 0.85 to a broker for spot-check, escalate below 0.60 to the importer's compliance team or to a CBP eRulings request. Tune the thresholds against a hand-classified gold set of 100 to 200 SKUs from the same catalog. Confidence numbers from different vendors are not directly comparable; calibrate to your own audit history.

Share:X