📄

OCR and Documents

Extract text and structure from PDFs, scanned documents, DOCX files, and images — ready for summarization, search, or downstream analysis.

作者 v1.0.0

Productivity & Tasks

正在连接 VM...

npx clawhub@latest install ocr-docu

0当前安装数

v1.0.0版本

OCR and Documents gives your AI assistant the ability to read and extract usable text from a wide variety of document formats — including born-digital PDFs, scanned paper documents, and DOCX files. Whether you need raw text, structured markdown, or specific fields pulled from an invoice or report, this skill preprocesses documents into clean output that other skills and workflows can act on.

工作原理

The skill selects the right extraction strategy based on the document type. Text-based PDFs are processed quickly using libraries like PyMuPDF or pdfminer. Scanned documents and image-heavy files are routed through an OCR pipeline (e.g. Tesseract or a compatible OCR service). DOCX files are parsed using python-docx. The extracted content is then normalized into plain text or structured markdown, ready for summarization, indexing, archival, or further analysis by downstream skills.

核心功能

Digital PDF Extraction

Fast, accurate text extraction from born-digital PDFs using PyMuPDF or pdfminer.

OCR for Scanned Documents

Recognizes text in scanned PDFs and document photos via Tesseract or a compatible OCR service.

DOCX Parsing

Pulls content from Microsoft Word files using python-docx, preserving structure where possible.

Markdown or Plain Text Output

Exports extracted content in a normalized format suitable for summarization, search, or archival pipelines.

Smart Routing

Detects document type and applies the appropriate extraction method automatically; routes PowerPoint files to a dedicated PPTX skill.

Multilingual Trigger Support

Responds to extraction requests in English and Chinese (e.g. "从 PDF 提取文字", "识别扫描件").

系统要求

Python Environment

A local Python environment or equivalent tooling is required to run the extraction libraries.

PyMuPDF or pdfminer

Needed for digital PDF text extraction.

Tesseract or OCR Service

Required for scanned documents and image-heavy files; OCR quality depends on scan clarity, language, and layout.

python-docx

Required for parsing DOCX files.

Note on PPTX

PowerPoint files are not handled by this skill; route them to a dedicated PPTX skill instead.

使用场景

Invoice Processing

OCR a scanned invoice and extract vendor name, amount, and due date automatically.

Document Summarization

Extract full text from a multi-page PDF and pass it to a summarization skill for page-by-page or section-level summaries.

Search Indexing

Convert a library of PDFs and DOCX files into plain text for indexing in a search or knowledge base system.

Archival Pipelines

Normalize legacy scanned documents into structured markdown for long-term storage and retrieval.

Research Assistance

Read and extract content from academic papers or reports so the assistant can answer questions or generate citations.

Multilingual Document Handling

Process documents and extraction requests in both English and Chinese workflows.

安装方式

Run in your terminal

npx clawhub@latest install ocr-docu

Click the Install button at the top of this page for one-click setup

正在连接 VM...

npx clawhub@latest install ocr-docu

0当前安装数

v1.0.0版本

评价

0 条评价

登录后撰写评价

暂无评价。来分享你的使用体验吧！