Extract text and structure from PDFs, scanned documents, DOCX files, and images — ready for summarization, search, or downstream analysis.
npx clawhub@latest install ocr-docuOCR and Documents extracts usable text and structure from PDFs, scanned documents, and common office formats like DOCX. It combines fast text extraction for born-digital PDFs with OCR pipelines for image-heavy or scanned inputs, outputting clean plain text or structured markdown. Install this skill when you need to feed document content into summarization, search, indexing, or any downstream workflow that requires readable text.
Uses PyMuPDF or pdfminer to pull text directly from born-digital PDFs without OCR overhead, preserving layout as closely as possible.
Routes scanned PDFs and document photos through Tesseract or a compatible OCR service to recover text from image-based content.
Reads Microsoft Word files via python-docx, extracting paragraph and heading structure into clean text or markdown output.
Normalizes extracted content into either plain text or structured markdown, making output immediately consumable by summarization, indexing, or other downstream skills.
Automatically selects the appropriate extractor — direct text layer, OCR, or format-specific parser — based on the document type detected at runtime.
OCR a scanned invoice or filled form and return structured fields such as vendor name, amount, and due date for downstream processing or storage.
Extract text from a multi-page PDF and hand each page's content to a summarization skill, enabling document-wide summaries without manual copy-paste.
Parse a Word document and convert its content to markdown, making it ready for a knowledge base, static site, or further editing in a text-based workflow.
Batch-extract text from a collection of mixed PDFs and DOCX files to produce clean, normalized text chunks suitable for vector or full-text search indexing.
Python environment (or equivalent tooling) plus one or more of the following libraries depending on your document types:
npx clawhub@latest install ocr-docunpx clawhub@latest install ocr-docuLog in to write a review
No reviews yet. Be the first to share your experience!