Review-Trust Pipeline: How We Make Reviews Reliable

Reliable review analysis requires transparency. At Collected.reviews, we use our own method: the Review-Trust Pipeline. It filters out noise, detects manipulation, and assesses reviews for reliability so that every theme score truly means something. Below, you can read how it works – with concrete figures.

Dataset

For this measurement, we used the dataset EU Retail Reviews v1.3, containing a total of 182,450 reviews (of which 169,732 were unique after deduplication). The period covers 1 January to 30 September 2025, with data from the Netherlands, Germany, Belgium and Austria, in the languages NL, DE and EN. The analysis was carried out using pipeline version 2.4.0.

Why This Is Necessary

Not all reviews are equally valuable. We identify three structural issues:

Manipulation – spikes in short periods, copied texts, or reward campaigns.
Noise – incomplete sentences, duplicate submissions, non-experiential opinions.
Bias – mostly extreme experiences are shared, or platforms moderate selectively.

To correct such distortion, we assess each review across six signals.

The Five Steps of Our Pipeline

Intake and Normalisation

All reviews are converted into a uniform schema (text, date, star rating, metadata). Exact duplicates are removed.
Identity and Behaviour

Account age, posting frequency, device patterns and timing clusters (where the source allows).
Text Signals

Semantic repetition, template phrases, and extreme sentiment without details.
Incentive Detection

Language indicating benefit (discount, cashback, gift card) → label “incentivised”.
Weighting and Normalisation

Each review receives a trust score (0–1). Theme scores are weighted and time-corrected (recent > old).

Important: We never delete anything arbitrarily; we evaluate it. Transparency over censorship.

Key Signals and Thresholds

Signal Threshold Effect Duplicate / near-duplicate ≥ 0.88 semantic overlap lower trust Timing spike peak within 12 hours vs baseline lower weighting Incentive language word list + context label “incentivised” Template phrases repetition score > 0.75 lower trust Lack of detail extreme sentiment without facts lower trust Account signals young account + high output lower trust

Weighting Model

Each component receives a weight; the formula in short:

trust = 1 − (0.35D + 0.20S + 0.20I + 0.10T + 0.10P + 0.05A) Component Symbol Weight Duplicate / near-dup D 0.35 Timing spike S 0.20 Incentive language I 0.20 Template phrases T 0.10 Lack of detail P 0.10 Account signals A 0.05 Time decay λ 0.015

Mini Results (Q1–Q3 2025)

Metric Value Share of near-duplicates 6.8% Share of incentivised reviews 12.4% Median trust score 0.73 Average theme score correction +4.6 points Detected spike events 89

This correction ensures more representative theme scores. A sector with many promotions is no longer artificially positive.

Example Cases

Case Signal Effect on trust C-1274 35 identical sentence parts within 2 hours −0.22 C-2091 Coupon mention + referral link −0.18 C-3310 40 reviews new account within 24 hours −0.26

Normalisation and Reporting

After weighting, we first normalise per platform (to compensate for moderation differences) and then cross-platform via z-score, so that all results appear on a single scale (0–100). On the company page, we display:

weighted theme scores,
sentiment distribution,
reliability band (CI),
share of incentivised reviews.

Limitations

Not every platform provides device or account data.
Short reviews remain difficult to assess.
Source bias: audience per source may differ from the actual customer base.
Irony or sarcasm is not always accurately recognised.

That’s why we report with margins and definitions rather than absolute truths.

What This Means for You

For Consumers

Trust patterns, not outliers. Check labels “incentivised” and “low repetition”.

For Companies

Address themes with high impact and low trust (e.g. billing or delivery time) for quick gains.