📊

Advanced Deduplicator

Advanced data cleaning tool to find and merge duplicate customer records based on fuzzy matching (Name, Email, Phone).

Excel ⭐⭐⭐ Advanced ⏱️ 10 minutes

😫 The Pain Point

Your customer list has 5,000 rows.

  • Row 1: “Nguyen Van A - 090xxx”
  • Row 500: “Nguyen V. A - 090xxx” Excel’s “Remove Duplicates” only catches exact matches. It fails when there’s a typo, slight variation, or missing data. Sending spam emails to the same client twice is unprofessional.

🚀 Agentic Solution

A “Smart Deduplication” tool using Fuzzy Logic (matching by similarity, not exactness).

Key Features:

  • Fuzzy Match: Detects “Jonh Doe” and “John Doe” as the same person (95% similarity).
  • Merge Strategy: Intelligently merges data (e.g., keeps the longest email, the newest phone number).

⚔️ Phase 1: Commander (Quick Fix)

For a quick cleanup of a specific file.

Prompt:

“I have a file customers.csv. Find potential duplicates based on the ‘Phone’ and ‘Email’ columns. Normalize the phone numbers first (remove spaces/dots). If two rows have the same phone, mark them as duplicates. Save the list of duplicates to duplicates.csv for me to review.”

Result: A list of duplicates to manually check.

🏗️ Phase 2: Architect (Permanent Tool)

For Data Analysts/CRM Admins.

Engineering Prompt:

**Role:** Python GUI Developer (PyQt6 Specialist)
**Task:** Create "Advanced Fuzzy Deduplicator" Desktop App

**Objective:** A desktop application to clean dirty customer data using fuzzy logic matching.

**Tech Stack:**
* Language: Python 3.10+
* GUI Library: PyQt6 (Cross-platform)
* Algorithms: rapidfuzz, pandas
* Packaging: PyInstaller

**Functional Requirements:**
1.  **UI Layout (PyQt6):**
    *   **Import:** Excel/CSV File Loader.
    *   **Config:** Checkboxes for "Match Columns" (Name, Email, Phone).
    *   **Tuning:** "Similarity Threshold" Slider (e.g., 85%).
    *   **Review:** Side-by-side comparison of potential merge groups.

2.  **Core Logic:**
    *   **Fuzzy Match:** Compute similarity scores using `rapidfuzz`.
    *   **Grouping:** Cluster records that exceed threshold.
    *   **Threading:** Data processing in background thread.

3.  **Deliverables:**
    *   `main.py`: Complete source code.
    *   `requirements.txt`: Dependencies.
    *   **Build Instructions:**
        *   Windows: `pyinstaller --onefile --noconsole main.py`
        *   macOS: `pyinstaller --windowed --noconsole main.py`

🧠 Prompt Decoding

  • Fuzzy Logic: Standard programming checks if A == B. Agentic programming checks if Distance(A, B) < Small_Amount. This allows for human-like flexibility in detecting errors.

🛠️ Instructions

  1. Copy Prompt -> Paste -> Run.
  2. Load Data -> Set Threshold 85% -> Scan.

Related Workflows

Explore other categories

📬

Get Started with Agentic Working

Subscribe to receive updates from AgenticWorking.io

📖 Free eBook Guide 📦 7 Ready-to-use Scripts 🔔 Weekly Tips

No spam, unsubscribe anytime. Join 1,000+ subscribers.