🎯 Vấn đề cần giải quyết
Có brochure PDF với ảnh đẹp muốn dùng lại? Cần extract logo từ file PDF? Lấy charts/graphs từ báo cáo?
Pain points:
- Screenshot thì chất lượng thấp
- Copy-paste mất resolution
- Không lấy được original image
⚖️ So sánh: Trước và Sau
| Tiêu chí | Screenshot | Extract |
|---|---|---|
| Quality | Mất | Original |
| Resolution | 72dpi | Full |
| Format | PNG only | Original format |
💡 Prompt mẫu
Trích xuất ảnh từ PDF:
INPUT: [file PDF]
EXTRACTION:
- All images: Có
- Minimum size: 100x100px (bỏ qua icons nhỏ)
- Pages: all / specific
OUTPUT:
- Format: Original (hoặc convert sang PNG/JPG)
- Naming: page{n}_img{m}.{ext}
- Folder: extracted_images/
- Create index: list với page reference
🏗️ Phase 2: Architect (Permanent Tool)
For Designers/Content Creators.
Engineering Prompt:
**Role:** Python GUI Developer (PyQt6 Specialist)
**Task:** Create "PDF Asset Miner" Desktop App
**Objective:** A tool to extract raw image resources from PDF documents at original quality.
**Tech Stack:**
* Language: Python 3.10+
* GUI Library: PyQt6 (Cross-platform)
* Extraction: PyMuPDF (fitz)
* Packaging: PyInstaller
**Functional Requirements:**
1. **UI Layout (PyQt6):**
* **Input:** Source PDF.
* **Filter:** Min Width/Height sliders (to filter out icons/lines).
* **Format:** "Original" vs "Convert to PNG/JPG".
* **Gallery:** Thumbnail grid of found images.
2. **Core Logic:**
* Iterate PDF objects to find image streams (XObject).
* Extract stats (width, height, color space).
* Save raw bytes or convert.
* **Threading:** Extraction loop runs in background.
3. **Deliverables:**
* `main.py`: Complete source code.
* `requirements.txt`: Dependencies.
* **Build Instructions:**
* Windows: `pyinstaller --onefile --noconsole main.py`
* macOS: `pyinstaller --windowed --noconsole main.py`
🔧 Tips & Best Practices
Image types in PDF
| Type | Extractable |
|---|---|
| Embedded JPEG | Yes, original |
| Embedded PNG | Yes, original |
| Vector graphics | As raster or SVG |
| Charts | As image |
Quality considerations
- Embedded = original quality
- PDF may have downsampled
- Vector → raster needs resolution setting
Use cases
- Asset recovery: Lấy lại ảnh từ old designs
- Research: Extract charts for analysis
- Archiving: Backup images separately
Độ khó: ⭐ Beginner | Thời gian: 3 phút