Skip to main content

Extracting Data from Files with AI

Use Trustana's AI to automatically extract product attributes from your uploaded PDF files. Learn how to run extraction tasks and review results.

Use Trustana's AI to extract product data directly from your own PDFs — spec sheets, industrial catalogues, price lists. This guide covers when PDF extraction is the right tool and how to get the best results from it.

For the step-by-step on starting a task, see How to Enrich Products in Trustana. For preparing and uploading your files, see Uploading and Managing Files.


When PDF extraction is the right tool

PDF extraction is built for suppliers and brands whose product data lives in documents rather than databases. It turns an 8-hour manual copy-paste job into a 30-minute automated task — with 95%+ accuracy on well-structured files.

It works best for:

  • Industrial B2B catalogues — large PDF files that are your only source of product specifications

  • Single-product spec sheets — one product per page, with a clear layout of attributes and values. This is the sweet spot for highest accuracy.

  • Multi-product tables — pages listing several products in rows or columns, single or multi-brand

  • Retail catalogues — multi-brand documents where each product has an identifiable SKU or model number

If your product data is already available on the web, Data Enrichment with external sourcing is usually faster. PDF extraction shines when your documents are the authoritative or only source.


What a good PDF looks like

File structure has a significant impact on extraction accuracy.

High accuracy:

  • Clean tables with clear column headers

  • Attribute-value pairs clearly associated — either in a table column or as vertical label/value pairs on a spec sheet

  • One product per page, or a clean multi-product table with one row per product

  • The product's SKU or Product Model visible on every page that contains data you want extracted

Lower accuracy:

  • Dense marketing prose with no clear attribute structure

  • Heavily merged or nested table cells

  • Product data spread across multiple pages with the identifier only on the first page

  • Many PDF pages merged into one very tall page

  • Poor quality scanned documents with low resolution or distorted text

Tip: If your file has one product per page and its data spans multiple pages, turn on Include Additional Context when starting the extraction task — this sends surrounding pages to the AI alongside the matched page.

Note: PDFs in multiple languages are accepted.


How matching works

Trustana matches your products to pages in the file using the SKU or Product Model as an identifier. Having both is ideal — at minimum, one must be present and filled correctly.

Matching is page-level: the AI looks at pages where the identifier appears and extracts data only from those pages. Pages that do not contain the identifier are skipped, even if they contain relevant product data.

This means:

  • The identifier must appear on every page that contains attributes you want extracted — not just the cover page

  • The identifier in Trustana must match exactly how it appears in the PDF — including format and spacing

  • Copying a product name into the Product Model field does not work — the field must contain a real model number that appears in the file

Identifier matching runs fresh for every task — it does not rely on any stored associations from previous runs.


Attribute-name alignment

The AI extracts data most accurately when your attribute names in Trustana closely match the column headers or labels in the PDF. Exact matches are best; close matches (e.g. "Size" vs "Dimensions") also work.

Two things to check before running extraction:

  • Each attribute you want to populate must have AI-Enabled turned on in your attribute settings

  • Attribute names should reflect the terminology your PDFs use — not generic internal labels

Note: The AI may paraphrase rather than copy values verbatim. A field labelled "Colour" in the PDF might be extracted as "Blue" into your "Color" attribute — that is expected behaviour, not an error.


Common pitfalls

Symptom

Likely cause

Product excluded from task

Missing SKU or Product Model, or no matching record found in any ingested file

No data extracted for a product

Identifier does not appear on the pages that contain the data

Wrong or incomplete data extracted

Attribute names diverge significantly from PDF column headers; or data spans pages without a repeated identifier

File not available for extraction

File ingestion not yet complete — wait for the email notification before starting a task


What's next

Did this answer your question?