🎯 Vấn đề cần giải quyết
Nhận bank statement PDF cần import vào sổ sách? Báo cáo PDF có bảng số liệu cần phân tích? Copy-paste thủ công bị lỗi format.
Pain points:
- Copy table từ PDF bị mất columns
- Số bị hiểu sai (1,000 vs 1.000)
- Phải fix thủ công từng cell
⚖️ So sánh: Trước và Sau
| Tiêu chí | Copy-paste | Agentic Workflow |
|---|---|---|
| Accuracy | 50% | 95%+ |
| Time/table | 10-20 phút | 30 giây |
| Number format | Sai | Đúng |
💡 Prompt mẫu
Trích xuất tables từ PDF:
INPUT: [file PDF có bảng]
EXTRACTION:
- Detect all tables: Có
- Pages: all / specific (1, 3, 5-10)
- Table selection: Auto detect
DATA CLEANING:
- Number format: Vietnamese (dấu chấm ngăn nghìn)
- Date format: DD/MM/YYYY
- Currency: VND
- Remove headers duplicates: Có
OUTPUT:
- Format: Excel (.xlsx)
- One sheet per table / per page
- Include source page reference
🏗️ Phase 2: Architect (Permanent Tool)
For Analysts.
Engineering Prompt:
**Role:** Python GUI Developer (PyQt6 Specialist)
**Task:** Create "PDF Table Extractor Pro" Desktop App
**Objective:** A dedicated tool to extract tabular data from PDFs into structured Excel files.
**Tech Stack:**
* Language: Python 3.10+
* GUI Library: PyQt6 (Cross-platform)
* Engine: pdfplumber / camelot
* Packaging: PyInstaller
**Functional Requirements:**
1. **UI Layout (PyQt6):**
* **Input:** PDF File.
* **Preview:** Visual detector showing red boxes around identified tables.
* **Settings:** "Flavor" (Lattice/Stream), "Merge Headers".
* **Export:** "Export to Excel".
2. **Core Logic:**
* Analyze page for lines/tables.
* Show confidence preview.
* Extract to DataFrame -> Header cleaning -> XLSX.
* **Threading:** Extraction is slow; show progress bar.
3. **Deliverables:**
* `main.py`: Complete source code.
* `requirements.txt`: Dependencies.
* **Build Instructions:**
* Windows: `pyinstaller --onefile --noconsole main.py`
* macOS: `pyinstaller --windowed --noconsole main.py`
🔧 Tips & Best Practices
Table types
| Type | Challenge | Solution |
|---|---|---|
| Simple grid | Easy | Auto detect |
| Merged cells | Medium | AI extraction |
| No borders | Hard | Column detection |
| Multi-page | Complex | Merge logic |
Data validation
- Check totals match
- Verify row counts
- Spot-check random cells
- Compare with PDF visually
Tools
- Tabula: Free, good for simple tables
- Camelot: Python, advanced options
- Amazon Textract: AI-powered, paid
Độ khó: ⭐⭐ Intermediate | Thời gian: 5 phút