📄

PDF to Excel

Extract tables từ PDF, bank statements, invoices thành file Excel có thể tính toán.

PDF ⭐⭐ Intermediate ⏱️ 5 phút

🎯 Vấn đề cần giải quyết

Nhận bank statement PDF cần import vào sổ sách? Báo cáo PDF có bảng số liệu cần phân tích? Copy-paste thủ công bị lỗi format.

Pain points:

  • Copy table từ PDF bị mất columns
  • Số bị hiểu sai (1,000 vs 1.000)
  • Phải fix thủ công từng cell

⚖️ So sánh: Trước và Sau

Tiêu chíCopy-pasteAgentic Workflow
Accuracy50%95%+
Time/table10-20 phút30 giây
Number formatSaiĐúng

💡 Prompt mẫu

Trích xuất tables từ PDF:

INPUT: [file PDF có bảng]

EXTRACTION:
- Detect all tables: Có
- Pages: all / specific (1, 3, 5-10)
- Table selection: Auto detect

DATA CLEANING:
- Number format: Vietnamese (dấu chấm ngăn nghìn)
- Date format: DD/MM/YYYY
- Currency: VND
- Remove headers duplicates: Có

OUTPUT:
- Format: Excel (.xlsx)
- One sheet per table / per page
- Include source page reference

🏗️ Phase 2: Architect (Permanent Tool)

For Analysts.

Engineering Prompt:

**Role:** Python GUI Developer (PyQt6 Specialist)
**Task:** Create "PDF Table Extractor Pro" Desktop App

**Objective:** A dedicated tool to extract tabular data from PDFs into structured Excel files.

**Tech Stack:**
* Language: Python 3.10+
* GUI Library: PyQt6 (Cross-platform)
* Engine: pdfplumber / camelot
* Packaging: PyInstaller

**Functional Requirements:**
1.  **UI Layout (PyQt6):**
    *   **Input:** PDF File.
    *   **Preview:** Visual detector showing red boxes around identified tables.
    *   **Settings:** "Flavor" (Lattice/Stream), "Merge Headers".
    *   **Export:** "Export to Excel".

2.  **Core Logic:**
    *   Analyze page for lines/tables.
    *   Show confidence preview.
    *   Extract to DataFrame -> Header cleaning -> XLSX.
    *   **Threading:** Extraction is slow; show progress bar.

3.  **Deliverables:**
    *   `main.py`: Complete source code.
    *   `requirements.txt`: Dependencies.
    *   **Build Instructions:**
        *   Windows: `pyinstaller --onefile --noconsole main.py`
        *   macOS: `pyinstaller --windowed --noconsole main.py`

🔧 Tips & Best Practices

Table types

TypeChallengeSolution
Simple gridEasyAuto detect
Merged cellsMediumAI extraction
No bordersHardColumn detection
Multi-pageComplexMerge logic

Data validation

  • Check totals match
  • Verify row counts
  • Spot-check random cells
  • Compare with PDF visually

Tools

  • Tabula: Free, good for simple tables
  • Camelot: Python, advanced options
  • Amazon Textract: AI-powered, paid

Độ khó: ⭐⭐ Intermediate | Thời gian: 5 phút

Related Workflows

Explore other categories

📬

Get Started with Agentic Working

Subscribe to receive updates from AgenticWorking.io

📖 Free eBook Guide 📦 7 Ready-to-use Scripts 🔔 Weekly Tips

No spam, unsubscribe anytime. Join 1,000+ subscribers.