📄

OCR and Documents

Extract text and structure from PDFs, scanned documents, DOCX files, and images — ready for summarization, search, or downstream analysis.

作者 v1.0.0
Productivity & Tasks
正在连接 VM...
正在连接 VM...
npx clawhub@latest install ocr-docu
0当前安装数
v1.0.0版本

OCR and Documents gives your AI assistant the ability to read and extract usable text from a wide variety of document formats — including born-digital PDFs, scanned paper documents, and DOCX files. Whether you need raw text, structured markdown, or specific fields pulled from an invoice or report, this skill preprocesses documents into clean output that other skills and workflows can act on.

工作原理

The skill selects the right extraction strategy based on the document type. Text-based PDFs are processed quickly using libraries like PyMuPDF or pdfminer. Scanned documents and image-heavy files are routed through an OCR pipeline (e.g. Tesseract or a compatible OCR service). DOCX files are parsed using python-docx. The extracted content is then normalized into plain text or structured markdown, ready for summarization, indexing, archival, or further analysis by downstream skills.

核心功能

Digital PDF Extraction
Fast, accurate text extraction from born-digital PDFs using PyMuPDF or pdfminer.
OCR for Scanned Documents
Recognizes text in scanned PDFs and document photos via Tesseract or a compatible OCR service.
DOCX Parsing
Pulls content from Microsoft Word files using python-docx, preserving structure where possible.
Markdown or Plain Text Output
Exports extracted content in a normalized format suitable for summarization, search, or archival pipelines.
Smart Routing
Detects document type and applies the appropriate extraction method automatically; routes PowerPoint files to a dedicated PPTX skill.
Multilingual Trigger Support
Responds to extraction requests in English and Chinese (e.g. "从 PDF 提取文字", "识别扫描件").

系统要求

Python Environment
A local Python environment or equivalent tooling is required to run the extraction libraries.
PyMuPDF or pdfminer
Needed for digital PDF text extraction.
Tesseract or OCR Service
Required for scanned documents and image-heavy files; OCR quality depends on scan clarity, language, and layout.
python-docx
Required for parsing DOCX files.
Note on PPTX
PowerPoint files are not handled by this skill; route them to a dedicated PPTX skill instead.

使用场景

Invoice Processing
OCR a scanned invoice and extract vendor name, amount, and due date automatically.
Document Summarization
Extract full text from a multi-page PDF and pass it to a summarization skill for page-by-page or section-level summaries.
Search Indexing
Convert a library of PDFs and DOCX files into plain text for indexing in a search or knowledge base system.
Archival Pipelines
Normalize legacy scanned documents into structured markdown for long-term storage and retrieval.
Research Assistance
Read and extract content from academic papers or reports so the assistant can answer questions or generate citations.
Multilingual Document Handling
Process documents and extraction requests in both English and Chinese workflows.

安装方式

1
Run in your terminal
npx clawhub@latest install ocr-docu
or
2
Click the Install button at the top of this page for one-click setup

评价

0 条评价

登录后撰写评价

暂无评价。来分享你的使用体验吧!