Skip to content

Merge duplicates

Duplicates are the #1 threat to data quality. Every duplicate erodes trust the moment you reach out — sending the same message twice to the same account is the fastest way to look like a spammer. The whole product is built around zero duplicates: prevent at the moment of creation, detect on demand, resolve by merge.

LeadHunter never auto-deletes a duplicate. Merge is always the operation, because deleting loses information you might still want.

Every account creation — manual entry, CSV import, Google Maps lookup, API post — runs through the same five-level stack. The first level that matches wins and the action depends on the confidence:

LevelMatchConfidenceAction
1Google Place ID — same place100%Auto-merge
2Name + city — exact, post-normalisation95%Auto-merge
3Phone — normalised, exact (strips formatting, country codes)90%Auto-merge
4Website domain — normalised, exact (www., scheme stripped)85%Auto-merge
5Fuzzy name within the same city — ≥85% similarity85+Suggest — surfaces in Find duplicates

Name normalisation strips legal suffixes (Inc, LLC, GmbH, SA, BV, …), filler words (the, a, and, …), punctuation, and common substitutions (&and). So “Joe’s Pizza & Co.” and “Joe’s Pizza and Co LLC” are exact-matched at level 2, not fuzzy.

Levels 1–4 are deterministic, exact matches — LeadHunter is confident enough to merge without asking. Level 5 is the one that surfaces for human review.

From Accounts → Find duplicates, LeadHunter scans the current project and returns groups of likely duplicates, sorted by confidence descending. Each group shows:

  • Every account in the group with its name, location, phone, website, and key identifiers side-by-side.
  • The proposed survivor (the “golden record” — the row that will absorb the others).
  • A field-by-field diff so you can see exactly what gets merged where.

When Find duplicates returns dozens of groups, an AI review button runs each group through an LLM pass that classifies them as same / different / uncertain with a one-line rationale. Use it to filter out obvious false positives before merging — the AI doesn’t decide, it just helps you skim faster.

For groups the AI marked confidently as same (and that you’ve spot-checked), the Bulk merge action runs every selected group through the same merge logic in one request. Useful after a noisy import where dozens of obvious duplicates landed in one go.

When a merge runs (auto on levels 1–4, on-confirm for level 5):

  1. Every unique field from every duplicate is preserved on the survivor. If one duplicate has a phone and the other has a website, the survivor ends up with both.
  2. Smart picks on conflicts when both rows have a value:
    • Email — prefers a non-generic address (maria@acme.com wins over info@acme.com).
    • Website — prefers the company’s own domain over an aggregator URL (e.g. acme.com over a Yelp listing).
    • Notes — concatenated, de-duplicated, with attribution.
    • Latitude + longitude — picked as a pair from the same source row (never half from one, half from the other).
    • Website content — the longer, fresher scrape wins.
    • Numeric / rating fields — the higher value wins.
    • Custom fields — deep-merged dict; multi-select arrays union.
  3. The survivor’s merge_history JSONB grows an entry recording: which accounts were absorbed, their full snapshot (so you can read what was on each row before the merge), who ran the merge, when, and which fields were overridden.
  4. Everything pointing at the absorbed rows is re-pointed at the survivor: campaign-account rows, conversation messages, research records, scores. No history is lost.
  5. Lifecycle status escalates, never demotes. The internal order is prospect < contacted < in_negotiation < lost < customer < do_not_contact. So:
    • do_not_contact always wins (GDPR / CAN-SPAM stickiness — an opt-out survives every merge).
    • customer beats lost — a recorded customer relationship outranks a never-sold lost deal even if the customer is older.
    • The audit trail records the previous status of every absorbed row.
  6. Acquisition channel sticks to the survivor’s value by default. Use the field-override controls on the merge dialog if the absorbed row’s channel was actually correct (e.g. you imported the outbound-discovered row first, but the inbound Adwords row was the real origin).
  7. AI picks the canonical name when the duplicates have different names — “Joe’s Pizza” over “Joe’s Pizza Restaurant LLC Holdings 2014”. You can override it manually in the merge dialog before confirming.
  8. Absorbed rows are deleted at the end. Their data lives on in the survivor’s fields and in merge_history, but the Account rows themselves are gone.

Every merge in the UI runs through a preview step first — same logic as the real merge, but no writes. The preview shows you the field-winner picks (with the source row that won), the LLM-selected name, the merge_history entry that would be written, and any overrides you’ve set. Confirm to commit, or back out to adjust.

For programmatic merges, hit the preview endpoint before the merge endpoint — same shape.

Deleting loses information. Even a “clearly bad” duplicate often has one field the survivor doesn’t — a phone number, a typo correction in the address, a different decision-maker contact, a different acquisition channel. Merging keeps every signal and the audit trail.

The cost of merging conservatively (keep everything) is one bigger row. The cost of deleting aggressively is permanent data loss. We optimise for the first.

Merges aren’t reversible — but the history is preserved

Section titled “Merges aren’t reversible — but the history is preserved”

When you confirm a merge, the absorbed accounts are deleted. There’s no “unmerge” button — the survivor row is now the truth and re-creating a clean split is hard to do automatically (which campaign-account row belonged to which side? which conversations?). The merge_history JSONB records the absorbed rows’ full snapshots so you can always read what was on each side, but restoring them is a manual rebuild.

In practice this rarely bites — most merges are obvious by the time they happen — but it’s worth knowing before you bulk-merge 50 fuzzy candidates without spot-checking.

The trade-off:

  • Merging two accounts that turn out to be different organisations: rare, but messy when it happens. The survivor row carries fields from both, conversations mix, the merge_history is the record. Manual rebuild required.
  • Not merging two accounts that are actually the same: common, low cost initially. Eventually, you reach out to the same place twice and look like a spammer.

Bias toward merging when the AI’s confidence is high and you’ve spot-checked the fields. Bias toward leaving them when names look similar but the addresses are clearly different (two unrelated Acme Bikes in two countries).

  • Bulk-merging without spot-checking the AI review. AI review is a triage helper, not a decision. Skim the uncertain group at minimum before clicking bulk-merge.
  • Forgetting that levels 1–4 already auto-merged on import. Find duplicates shows you what slipped through levels 1–4 — usually accounts that share a fuzzy name but no place id, phone, or website overlap. If you’re seeing what looks like an obvious phone duplicate, it usually means the phone wasn’t on one of the rows at import time.
  • Manually merging accounts the system flagged as different. The AI is wrong sometimes, but when it confidently says two accounts are different, it’s worth re-reading the field diff before overriding.
  • Treating merge_history as a backup. It’s an audit trail. Restoring an absorbed row from the snapshot is a manual exercise — possible, not push-button.
  • Account — the row model the merge produces.
  • Import accounts — where most duplicates originate, and how to prevent them.
  • Track inbound leads — inbound channels create duplicate-prone overlap with outbound rows; tagging acquisition_channel correctly helps disambiguate.