π« The Pain Point
A client sends a Word report with 50 site photos inside. You need those photos as separate JPGs to upload to your system. Right-click -> Save as Picture⦠50 times? No thanks.
π Agentic Solution
Deep Extraction: Digs into the file structure (Zip/PDF internal) to retrieve the original assets.
Key Features:
- Original Quality: Retrieves the exact file that was inserted, not a compressed version.
- Bulk: Process an entire folder of reports.
βοΈ Phase 1: Commander (Quick Fix)
For extracting from a single file.
Prompt:
βI have
report.docx. Since docx is a zip file, extract all media images inside it to an βImagesβ folder. Or use a Python library to do it.β
Result: All images extracted instantly.
ποΈ Phase 2: Architect (Permanent Tool)
For Archivists/Designers.
Engineering Prompt:
**Role:** Python Document Developer
**Task:** Create an "Asset Extractor for Docs".
**Requirements:**
1. **GUI:**
* Select File Type (Word or PDF).
* Select Input Folder.
* "Extract Images" button.
2. **Logic:**
* **Word:** unzip the `.docx` and copy files from `word/media` (Fastest & Best quality).
* **PDF:** Use `fitz` (PyMuPDF) to iterate pages and `get_images()`.
* Save to Output Folder.
3. **Deliverables:** `doc_img_extract.py`, `run.bat` (Windows), `run.sh` (Mac).
π§ Prompt Decoding
- Word as Zip: A
.docxfile is literally a zipped folder of XMLs and Images. Treating it as a zip file is a βhackerβ trick that makes extraction instant and robust without needing MS Word installed.
π οΈ Instructions
- Copy Prompt -> Paste -> Run.
- Select Doc/PDF -> Extract.