Use Trustana's AI to extract product data directly from your own PDFs — spec sheets, industrial catalogues, price lists. This guide covers when PDF extraction is the right tool and how to get the best results from it.
For the step-by-step on starting a task, see How to Enrich Products in Trustana. For preparing and uploading your files, see Uploading and Managing Files.
When PDF extraction is the right tool
PDF extraction is built for suppliers and brands whose product data lives in documents rather than databases. It turns an 8-hour manual copy-paste job into a 30-minute automated task — with 95%+ accuracy on well-structured files.
It works best for:
Industrial B2B catalogues — large PDF files that are your only source of product specifications
Single-product spec sheets — one product per page, with a clear layout of attributes and values. This is the sweet spot for highest accuracy.
Multi-product tables — pages listing several products in rows or columns, single or multi-brand
Retail catalogues — multi-brand documents where each product has an identifiable SKU or model number
If your product data is already available on the web, Data Enrichment with external sourcing is usually faster. PDF extraction shines when your documents are the authoritative or only source.
What a good PDF looks like
File structure has a significant impact on extraction accuracy.
High accuracy:
Clean tables with clear column headers
Attribute-value pairs clearly associated — either in a table column or as vertical label/value pairs on a spec sheet
One product per page, or a clean multi-product table with one row per product
The product's SKU or Product Model visible on every page that contains data you want extracted
Lower accuracy:
Dense marketing prose with no clear attribute structure
Heavily merged or nested table cells
Product data spread across multiple pages with the identifier only on the first page
Many PDF pages merged into one very tall page
Poor quality scanned documents with low resolution or distorted text
Tip: If your file has one product per page and its data spans multiple pages, turn on Include Additional Context when starting the extraction task — this sends surrounding pages to the AI alongside the matched page.
Note: PDFs in multiple languages are accepted.
How matching works
Trustana matches your products to pages in the file using the SKU or Product Model as an identifier. Having both is ideal — at minimum, one must be present and filled correctly.
Matching is page-level: the AI looks at pages where the identifier appears and extracts data only from those pages. Pages that do not contain the identifier are skipped, even if they contain relevant product data.
This means:
The identifier must appear on every page that contains attributes you want extracted — not just the cover page
The identifier in Trustana must match exactly how it appears in the PDF — including format and spacing
Copying a product name into the Product Model field does not work — the field must contain a real model number that appears in the file
Identifier matching runs fresh for every task — it does not rely on any stored associations from previous runs.
Attribute-name alignment
The AI extracts data most accurately when your attribute names in Trustana closely match the column headers or labels in the PDF. Exact matches are best; close matches (e.g. "Size" vs "Dimensions") also work.
Two things to check before running extraction:
Each attribute you want to populate must have AI-Enabled turned on in your attribute settings
Attribute names should reflect the terminology your PDFs use — not generic internal labels
Note: The AI may paraphrase rather than copy values verbatim. A field labelled "Colour" in the PDF might be extracted as "Blue" into your "Color" attribute — that is expected behaviour, not an error.
Common pitfalls
Symptom | Likely cause |
Product excluded from task | Missing SKU or Product Model, or no matching record found in any ingested file |
No data extracted for a product | Identifier does not appear on the pages that contain the data |
Wrong or incomplete data extracted | Attribute names diverge significantly from PDF column headers; or data spans pages without a repeated identifier |
File not available for extraction | File ingestion not yet complete — wait for the email notification before starting a task |
What's next
