📊

Advanced Deduplicator

Advanced data cleaning tool to find and merge duplicate customer records based on fuzzy matching (Name, Email, Phone).

Excel ⭐⭐⭐ Advanced ⏱️ 10 minutes

😫 The Pain Point

Your customer list has 5,000 rows.

  • Row 1: “Nguyen Van A - 090xxx”
  • Row 500: “Nguyen V. A - 090xxx” Excel’s “Remove Duplicates” only catches exact matches. It fails when there’s a typo, slight variation, or missing data. Sending spam emails to the same client twice is unprofessional.

🚀 Agentic Solution

A “Smart Deduplication” tool using Fuzzy Logic (matching by similarity, not exactness).

Key Features:

  • Fuzzy Match: Detects “Jonh Doe” and “John Doe” as the same person (95% similarity).
  • Merge Strategy: Intelligently merges data (e.g., keeps the longest email, the newest phone number).

⚔️ Phase 1: Commander (Quick Fix)

For a quick cleanup of a specific file.

Prompt:

“I have a file customers.csv. Find potential duplicates based on the ‘Phone’ and ‘Email’ columns. Normalize the phone numbers first (remove spaces/dots). If two rows have the same phone, mark them as duplicates. Save the list of duplicates to duplicates.csv for me to review.”

Result: A list of duplicates to manually check.

🏗️ Phase 2: Architect (Permanent Tool)

For Data Analysts/CRM Admins.

Engineering Prompt:

**Role:** Python Data Data Engineer
**Task:** Create an "Advanced Fuzzy Deduplicator".
**Requirements:**
1.  **GUI:**
    *   Load Excel/CSV.
    *   Select Columns to match (e.g., Name, Email, Phone).
    *   Slider: "Similarity Threshold" (e.g., 90% match).
    *   "Find Duplicates" button.
2.  **Logic:**
    *   Use `rapidfuzz` library for high-speed string matching.
    *   Group records that score above the threshold.
    *   Display groups side-by-side for user verification before merging.
3.  **Deliverables:** `deduplicator.py`, `run.bat` (Windows), `run.sh` (Mac).

🧠 Prompt Decoding

  • Fuzzy Logic: Standard programming checks if A == B. Agentic programming checks if Distance(A, B) < Small_Amount. This allows for human-like flexibility in detecting errors.

🛠️ Instructions

  1. Copy Prompt -> Paste -> Run.
  2. Load Data -> Set Threshold 85% -> Scan.

Related Workflows

Explore other categories

📬

Get Started with Agentic Working

Subscribe to receive updates from AgenticWorking.io

📖 Free eBook Guide 📦 7 Ready-to-use Scripts 🔔 Weekly Tips

No spam, unsubscribe anytime. Join 1,000+ subscribers.