📄

OCR and Documents

Extract text and structure from PDFs, scanned documents, DOCX files, and images — ready for summarization, search, or downstream analysis.

by v1.0.0

Productivity & Tasks

Connecting to VM...

npx clawhub@latest install ocr-docu

2Current Installs

v1.0.0Version

OCR and Documents gives your AI assistant the ability to read and extract usable text from a wide variety of document formats — including born-digital PDFs, scanned paper documents, and DOCX files. Whether you need raw text, structured markdown, or specific fields pulled from an invoice or report, this skill preprocesses documents into clean output that other skills and workflows can act on.

How It Works

The skill selects the right extraction strategy based on the document type. Text-based PDFs are processed quickly using libraries like PyMuPDF or pdfminer. Scanned documents and image-heavy files are routed through an OCR pipeline (e.g. Tesseract or a compatible OCR service). DOCX files are parsed using python-docx. The extracted content is then normalized into plain text or structured markdown, ready for summarization, indexing, archival, or further analysis by downstream skills.

Key Features

Digital PDF Extraction

Fast, accurate text extraction from born-digital PDFs using PyMuPDF or pdfminer.

OCR for Scanned Documents

Recognizes text in scanned PDFs and document photos via Tesseract or a compatible OCR service.

DOCX Parsing

Pulls content from Microsoft Word files using python-docx, preserving structure where possible.

Markdown or Plain Text Output

Exports extracted content in a normalized format suitable for summarization, search, or archival pipelines.

Smart Routing

Detects document type and applies the appropriate extraction method automatically; routes PowerPoint files to a dedicated PPTX skill.

Multilingual Trigger Support

Responds to extraction requests in English and Chinese (e.g. "从 PDF 提取文字", "识别扫描件").

Requirements

Python Environment

A local Python environment or equivalent tooling is required to run the extraction libraries.

PyMuPDF or pdfminer

Needed for digital PDF text extraction.

Tesseract or OCR Service

Required for scanned documents and image-heavy files; OCR quality depends on scan clarity, language, and layout.

python-docx

Required for parsing DOCX files.

Note on PPTX

PowerPoint files are not handled by this skill; route them to a dedicated PPTX skill instead.

Use Cases

Invoice Processing

OCR a scanned invoice and extract vendor name, amount, and due date automatically.

Document Summarization

Extract full text from a multi-page PDF and pass it to a summarization skill for page-by-page or section-level summaries.

Search Indexing

Convert a library of PDFs and DOCX files into plain text for indexing in a search or knowledge base system.

Archival Pipelines

Normalize legacy scanned documents into structured markdown for long-term storage and retrieval.

Research Assistance

Read and extract content from academic papers or reports so the assistant can answer questions or generate citations.

Multilingual Document Handling

Process documents and extraction requests in both English and Chinese workflows.

How to Install

Run in your terminal

npx clawhub@latest install ocr-docu

Click the Install button at the top of this page for one-click setup

Connecting to VM...

npx clawhub@latest install ocr-docu

2Current Installs

v1.0.0Version

Reviews

0 reviews

No reviews yet. Be the first to share your experience!