TL;DR
- Bulk HTS classification turns a column of free-text product descriptions into a column of 10-digit HTS codes plus a confidence score per row. Done well, it lets a single broker run a 50,000-SKU catalog through tariff impact analysis in an afternoon.
- Three tooling tiers exist: (1) a cold LLM call (GPT, Claude), cheapest and worst, prone to hallucinated codes; (2) a retrieval-grounded search against the live HTSUS, which eliminates hallucinations but trades on paraphrase recall; (3) a multi-layer classifier that combines CROSS rulings, lexical retrieval, and semantic embeddings, the strongest option but not always free. Workflow choice depends on catalog size, accuracy bar, and budget.
- Confidence scoring is the load-bearing part regardless of tool. Auto-accept top-1 candidates above 0.85, send 0.60 to 0.85 rows to a broker for review, escalate below 0.60 to the importer compliance team or to a CBP eRulings request at rulings.cbp.gov.
- Legal responsibility under 19 USC 1484 sits with the importer of record. Any classifier output, from any vendor, is a research aid. The audit trail (which CROSS ruling was cited, which heading text was applied, which GRI was determinative) is what makes the result defensible at liquidation.
Why bulk classification is hard
The HTSUS has roughly 19,000 active 10-digit subheadings. The legal text that defines each one is written in 1970s tariff prose. The General Rules of Interpretation (GRI 1 through 6) force a strict hierarchy: classify by heading text first, then by chapter and section notes, then by essential character (for composite goods), then by the heading appearing last in numerical order (the GRI 3(c) tiebreaker). A SKU description that says "stainless steel kitchen knife with plastic handle" could land at 8211.92.20.00 (Kitchen and butcher knives), at 8211.91.50.60 (Knives with rubber or plastic handles, other), or back at 8211.10.00.00 (Sets of assorted articles), depending on GRI sequencing.
A single broker classifying by hand can do maybe 200 to 400 SKUs a day, which is fine for a single shipment but breaks for ecommerce catalogs at 10,000+ SKUs and for importers running a tariff impact analysis after a 232 expansion. Bulk classification is the workflow to clear that.
Three failure modes the workflow has to solve
Hallucinations. A naive GPT or Claude call will return plausible-looking 10-digit codes that do not exist in the HTSUS. The standard fix is to ground the classifier in a live HTSUS index so every candidate has to exist by construction; failure mode shifts to "no confident match" rather than "wrong-looking match."
Composite goods. A polyester knit running short with a cotton waistband, or a wireless earbud with charging case (two separate articles), forces a GRI 3(b) essential-character call. The classifier needs to surface the call explicitly, not bury it in a single confidence number.
Drift. The HTSUS revises every January (and out-of-cycle when proclamations land, like the April 6, 2026 Section 232 consolidation). A classifier trained on last year's HTSUS will silently misroute the SKUs whose code changed. Pick a tool whose index pulls from hts.usitc.gov on a regular cadence and confirm staleness in writing.
Tooling options
Three tiers, ordered by accuracy. Pick the one that matches your catalog size and accuracy bar.
Tier 1: Cold LLM call (GPT, Claude)
Cheapest. Worst accuracy. Top-1 lands in the 60 to 75% range across published benchmarks, with hallucinated codes (codes that do not exist in the HTSUS at all) as the dominant failure mode. Use only if the catalog is small (under 200 SKUs) and a broker will hand-review every row anyway. Always cross-check every returned code against the live HTSUS at hts.usitc.gov before accepting.
Tier 2: Retrieval-grounded HTS search
A lexical (BM25-style) search over the HTSUS subheading definitions and chapter/section notes. Eliminates hallucinations because every candidate must exist in the subheading list. Strong on exact noun matches, weak on paraphrase recall. The Tandom catalog at tariffs.tandom.ai/hts-catalog surfaces this style of retrieval; the same engine sits behind most paid Classifier services as well. Adequate for ecommerce ops who can tolerate manual triage on borderline rows.
Tier 3: Multi-layer classifier (CROSS + lexical + semantic)
Composes three retrieval strategies: CBP's CROSS binding-ruling database (~250,000+ rulings), lexical retrieval over HTSUS text, and semantic embeddings. Cross-validated top-1 (a code that appears across multiple layers) is high-confidence; conflicts trigger triage. Highest accuracy (top-1 lifts above 90% on published apparel/electronics benchmarks). Several vendors operate in this tier; pick on integration fit, pricing, and how often the underlying HTSUS index refreshes.
Tandom's next product launch
Tandom HTS Classification is the next Tandom product to ship, currently in closed beta. The three-layer engine described in Tier 3 (CROSS rulings + lexical + semantic embeddings) with a defensible audit trail per SKU.
Join the waitlist →Worked example, 10 SKUs
Ten product descriptions a typical ecommerce ops team would push through bulk classification. The script structure below works against any retrieval-grounded HTS search endpoint that returns ranked candidates per query. Every code in the result table was verified against the live HTSUS at hts.usitc.gov.
The Node.js script
// classify.mjs
// Loops a list of SKU descriptions through an HTS search endpoint
// and reports the top candidate plus a confidence band. Adapt the
// ENDPOINT + response shape to whichever classifier you're using.
const ENDPOINT = "https://tariffs.tandom.ai/api/hts/search";
const skus = [
{ id: "SKU-001", desc: "polyester knit shorts mens athletic" },
{ id: "SKU-002", desc: "kitchen knife stainless steel chef" },
{ id: "SKU-003", desc: "bluetooth wireless earphones charging case" },
{ id: "SKU-004", desc: "nylon backpack laptop compartment" },
{ id: "SKU-005", desc: "plastic toothbrush manual adult" },
{ id: "SKU-006", desc: "passenger car tire 17 inch radial" },
{ id: "SKU-007", desc: "cordless vacuum cleaner handheld" },
{ id: "SKU-008", desc: "led flashlight aluminum body" },
{ id: "SKU-009", desc: "yoga mat synthetic foam" },
{ id: "SKU-010", desc: "polyester webbing strap industrial" },
];
async function classify(desc) {
const url = `${ENDPOINT}?q=${encodeURIComponent(desc)}&limit=5`;
const res = await fetch(url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const data = await res.json();
return data.results || [];
}
function band(results) {
if (results.length === 0) return { code: null, score: 0, band: "no-match" };
// Cheap proxy: top-1 with a 10-digit (full statistical) code is high confidence.
const top = results[0];
const isFullCode = /^\d{4}\.\d{2}\.\d{2}\.\d{2}$/.test(top.htsno);
if (!isFullCode) return { code: top.htsno, score: 0.55, band: "low" };
if (results.length >= 3) return { code: top.htsno, score: 0.80, band: "mid" };
return { code: top.htsno, score: 0.92, band: "high" };
}
(async () => {
for (const sku of skus) {
const results = await classify(sku.desc);
const verdict = band(results);
console.log(JSON.stringify({ ...sku, ...verdict }));
}
})();Output, after a Tier 2 search and broker review
The script above is the bulk-loop scaffold. The HTS codes in the table below are the codes a broker would settle on after seeing the search candidates. Lexical-only retrieval against multi-word phrases like "polyester knit shorts mens athletic" often returns no top-1 match because the legal HTSUS text uses formal phrasings; in those cases the broker reads the top-N candidates and picks. A multi-layer Tier 3 classifier closes the gap automatically by adding CROSS retrieval and semantic embeddings, but the broker review step never goes away entirely on borderline rows.
| SKU | Description | Top HTS | Band | Verdict |
|---|---|---|---|---|
| SKU-001 | polyester knit shorts mens athletic | 6103.43.15.40Mens shorts of synthetic fibers, knit | high | Auto-accept |
| SKU-002 | kitchen knife stainless steel chef | 8211.92.20.00Kitchen and butcher knives | high | Auto-accept |
| SKU-003 | bluetooth wireless earphones charging case | 8518.30.20.00Headphones, earphones, other | low | Escalate, GRI 3(b) call on the case |
| SKU-004 | nylon backpack laptop compartment | 4202.92.31.20Backpacks of textile materials | high | Auto-accept |
| SKU-005 | plastic toothbrush manual adult | 9603.21.00.00Toothbrushes, including dental-plate brushes | high | Auto-accept |
| SKU-006 | passenger car tire 17 inch radial | 4011.10.10.50New pneumatic tires, motor cars, radial | mid | Broker spot-check, AD/CVD adjacent |
| SKU-007 | cordless vacuum cleaner handheld | 8508.11.00.00Vacuum cleaners with self-contained electric motor | high | Auto-accept |
| SKU-008 | led flashlight aluminum body | 8513.10.40.00Portable electric lamps, other | high | Auto-accept |
| SKU-009 | yoga mat synthetic foam | 9506.91.00.30Articles for general physical exercise, other | mid | Broker spot-check, possible 39.26 alternative |
| SKU-010 | polyester webbing strap industrial | 6307.90.98.91Other made-up textile articles, other | low | Escalate, end-use ambiguous |
The auto-accept rows flow straight into the catalog enrichment pipeline. The mid-confidence rows go to a broker for one-line confirmation against the heading text. The low-confidence rows (the wireless earbud charging-case combo here, where two products complicate GRI 3(b)) get a CBP CROSS search or a binding-ruling request through rulings.cbp.gov.
Triage low-confidence rows
Confidence-banding is what makes bulk classification trustable. Without it, a 5,000-SKU file produces a 5,000-row spreadsheet that nobody knows whether to trust. With it, the broker reviews maybe 200 rows.
The 0.85 / 0.60 / below-0.60 split
A working policy that holds up across most ecommerce catalogs:
- Above 0.85: auto-accept. Code flows into the catalog. Spot-audit 1 in 50 against the broker's hand-classified gold set. Investigate any chapter-level disagreement.
- 0.60 to 0.85: broker reviews. One-line spot-check against the HTSUS heading text. Most go through; the ones that don't get bumped to the next bucket.
- Below 0.60: escalate to importer compliance. For high-stakes lines (AD/CVD adjacent, novel product, dispute history), the importer requests a CBP binding ruling through rulings.cbp.gov. Binding rulings take 30 to 90 days but eliminate classification risk for the lifetime of that product.
Calibrate against your own gold set
The 0.85 / 0.60 thresholds are starting points, not universals. The right thresholds depend on the catalog. A single-product-family catalog (apparel only, electronics only) calibrates much tighter; a generalist marketplace stays wider. Hand-classify 100 to 200 SKUs from the catalog, run the classifier on those same SKUs, and pick thresholds where auto-accepted rows are 98%+ correct against the gold set.
What to do with conflicts across layers
When CROSS retrieval points to chapter 84 and lexical retrieval points to chapter 39 for the same SKU, the CROSS layer ordinarily wins because it's grounded in a CBP determination. The exception is when the CROSS ruling is older than 5 years and the HTSUS subheading text has changed since (an annual revision or out-of-cycle proclamation). In that case, treat the CROSS ruling as advisory and request a fresh binding ruling.
Common pitfalls
The mistakes that cost the most time when running bulk classification at scale.
Trusting a top-1 GPT call
A cold GPT-4 or Claude call against a SKU description returns a plausible-looking code in well-formed dotted notation. The code might not exist in the HTSUS at all. The fix: every candidate code from any model must be cross-checked against the live HTSUS index at hts.usitc.gov (or any retrieval-grounded classifier) before it's accepted.
Ignoring chapter and section notes
Chapter and section notes carve specific products into or out of headings that the heading text alone would not predict. Chapter 90 Note 1, for instance, excludes goods of base metal unfit for any use other than mounting in instruments. A SKU that lexically reads as "optical instrument" can be redirected out of chapter 90 by a section note. Pick a classifier whose index includes chapter and section notes; a thin subheading-text-only match misses these calls.
Composite goods and GRI 3(b)
A leather wallet with steel zipper, a polyester running short with cotton waistband, a wireless earbud with charging case: GRI 3(b) requires the importer to identify the component that gives the article its essential character. Lexical retrieval tends to land on whichever component the description mentions first. Always confirm composite-goods classifications by reading the General Explanatory Notes (a public GRI 3(b) reference) before auto-accepting.
Set classification (GRI 3(b) sets)
When the SKU is a true set (knife block with 12 different knives, kitchen utensil set with 8 pieces), the classification might pivot to a single heading covering the set rather than summing the components. Heading 8211.10.00.00 ("Sets of assorted articles") with the rate of duty applicable to the set's highest-rated component is the classic example. Bulk-classification scripts that assume one SKU equals one single-product code miss these.
Stale classifications after an HTSUS revision
The HTSUS revises every January. Out-of-cycle revisions (like the April 6, 2026 Section 232 consolidation that re-routed derivative codes from 9903.80 / 9903.81 / 9903.85 to the new 9903.82 family) silently invalidate prior classifications. A monthly bulk re-classification pass against an HTSUS index that pulls from hts.usitc.gov on a daily cadence catches drift before it shows up as a Notice of Action from CBP.
Translation errors on non-English descriptions
Catalogs sourced from a global SKU master often carry mixed languages. The classifier is English-only as of May 2026. Detect non-ASCII characters in the description column before sending; translate first, then classify, and flag any row where the translation introduced ambiguity (often the case with technical product specs).
Confidence inflation on short descriptions
A two-word description ("running shorts") will score confidently against 6103.43.15.40 even when the actual product is woven (chapter 6203, not 6103). Lexical retrieval matches word-for-word and does not know the difference between knit and woven. Enforce a minimum description length (5 to 8 tokens) before auto-accepting.
AD/CVD-adjacent codes
Some HTS codes are advisory-flagged in active AD/CVD orders (Steel Threaded Rod 7318.15.x, Steel Nails 7317.00.55, certain aluminum extrusions 7610.x, hardwood plywood 4412.x). A confident classification at one of these codes does not mean the product is in scope (HTS in AD/CVD orders is advisory; the product-scope language controls), but it does mean the broker must run a scope check before assessing duty. ITA's ACCESS at access.trade.gov and CBP's AD/CVD search tool both surface the active order list per HTS heading.
Treating the API output as legally binding
Under 19 USC 1484, the importer of record bears legal responsibility for classification. API output is a research aid. Document the classification rationale per SKU (which CROSS ruling was referenced, which heading text was applied, which GRI was determinative) so a CBP audit trail exists.
Glossary
- HTS / HTSUS
- Harmonized Tariff Schedule of the United States. 10-digit codes; first 6 digits are international, last 4 are US-specific statistical breakouts. Authoritative source: hts.usitc.gov.
- GRI
- General Rules of Interpretation, six legal rules that govern HTSUS classification. GRI 1 (heading text), GRI 2 (parts and unfinished articles), GRI 3 (mixtures, composite goods, sets), GRI 4 (most-akin), GRI 5 (cases and packing), GRI 6 (subheadings).
- CROSS
- Customs Rulings Online Search System. CBP's database of roughly 250,000+ binding classification rulings. Each ruling assigns a specific product to a specific HTS code and is authoritative for substantially similar products.
- Binding ruling
- A written CBP determination that a specific product is classified at a specific HTS code (or qualifies for a specific origin treatment, valuation method, or marking). Requested at rulings.cbp.gov; takes 30 to 90 days.
- Essential character (GRI 3(b))
- For composite goods, the test for which component imparts the essential character to the article. Drives the classification heading.
- Multi-layer classifier
- A classifier that composes multiple retrieval strategies (typically CROSS binding-ruling retrieval, lexical retrieval over HTSUS text, and semantic embeddings) and re-ranks against a confidence score. Most paid HTS classifiers fall into this tier.
- Top-1 accuracy
- Fraction of test SKUs where the classifier's highest-scored candidate matches the broker's hand-classified gold-standard code. Standard benchmark for classification quality.
- Hallucinated code
- A model-generated HTS code that does not exist in the active HTSUS. The standard failure mode for cold LLM calls; absent by construction in retrieval-grounded systems.
- Confidence band
- A discretized confidence score (high, mid, low) used to route classifications to auto-accept, broker review, or importer escalation buckets.
- Lexical retrieval
- BM25-style full-text search. Strong on exact-noun matches, weak on paraphrase. The simplest retrieval-grounded approach and the default behind most public HTS search endpoints.
- Semantic embedding
- A neural representation of a description's meaning as a vector. Two paraphrased descriptions of the same product produce nearby vectors. Robust to paraphrase, weaker on generic short descriptions.
- Notice of Action (CBP Form 29)
- CBP-issued correction notice when an entered HTS code is disputed during liquidation. Triggers protest rights at 19 USC 1514, 180 days.
FAQ
High-intent questions ecommerce ops, brokers, and compliance teams ask most often.