- Python 100%
| src/abstract_pdfs | ||
| tests | ||
| .gitignore | ||
| PKG-INFO | ||
| pyproject.toml | ||
| README.md | ||
| setup.cfg | ||
| setup.py | ||
Part of the Abstract Intelligence Platform
This module is part of a unified system for transforming raw media into structured, searchable, and SEO-optimized data.
abstract_pdfs handles document ingestion and publishing:
- PDF → structured pages (text + images)
- metadata + manifest generation
- static HTML output (viewer + gallery)
Full system: https://github.com/AbstractEndeavors/abstract-intelligence
abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content
A structured pipeline for transforming PDFs into searchable, metadata-rich, web-ready content, combining OCR, page-level analysis, metadata generation, and static site scaffolding.
Designed for:
- large PDF collections
- SEO-driven content indexing
- document-to-web publishing pipelines
- structured ingestion of unstructured media
🔹 What This System Is
abstract_pdfs is not a PDF utility — it is a full document processing pipeline:
- ingests raw PDFs
- decomposes them into pages, images, and text
- extracts and generates metadata
- enriches content via NLP APIs
- builds structured outputs (JSON + HTML)
- generates navigable web content (galleries + viewers)
The result is a fully browsable, searchable document corpus.
🔹 Pipeline Overview
PDF Input
↓
Slice / Decompose (images + text per page)
↓
OCR + Text Extraction (layout-aware engines)
↓
Metadata Generation
├─ summaries
├─ keywords
├─ descriptions
↓
Manifest Creation (per-page + per-document)
↓
HTML Generation
├─ PDF viewer pages
├─ gallery index pages
↓
Static Site Output (SEO-ready)
flowchart TD
A[PDF Input]
B[DocumentPipeline]
C[SliceManager\nPage Images + Text + OCR]
D[Per-Page Assets\nThumbnails / Text / Info JSON]
E[Manifest Generation\nPage + Document Metadata]
F[NLP Enrichment\nSummaries + Keywords + Descriptions]
G[HTML Generation\nViewer Pages + Gallery Indexes]
H[Static Output\nSearchable / SEO-ready PDF Corpus]
A --> B --> C --> D --> E --> F --> G --> H
🔹 Core Capabilities
Document Decomposition
-
Splits PDFs into:
- page images
- extracted text
- structured page directories
-
Maintains consistent directory structure for downstream processing
Metadata & SEO Enrichment
-
Generates:
- summaries
- keywords
- descriptions
-
Integrates with NLP endpoints for:
- text analysis
- keyword refinement
- summarization
Example: page-level analysis via API calls
Manifest Generation
-
Produces structured JSON per page:
- metadata
- text
- image references
- SEO fields
-
Aggregates into document-level manifests
Static Site Generation
-
Generates:
- PDF viewer pages (page-by-page navigation)
- gallery index pages (directory browsing)
-
Automatically builds:
- thumbnails
- descriptions
- keyword tags
Example: dynamic card generation for directories
Path ↔ URL Mapping
-
Converts filesystem structure into web-accessible URLs
-
Maintains consistency between:
- local storage (
/srv/media/...) - public endpoints (
/pdfs/...)
- local storage (
Content Structuring
-
Page-level:
- text
- summary
- keywords
-
Document-level:
- aggregated metadata
- full-text indexing
🔹 Architecture
The system is composed of modular components:
-
DocumentPipeline
- orchestrates ingestion → processing → output
-
SliceManager
- handles PDF decomposition and OCR
-
Manifest Generators
- build structured JSON representations
-
HTML Generators
- render viewer and gallery pages
-
Metadata Utilities
- enrich content via external NLP services
Each stage is:
- independent
- composable
- replaceable
🔹 Key Design Decisions
Page-Level First
All processing happens per-page, enabling:
- granular indexing
- targeted metadata
- scalable processing
Structured Over Raw
Outputs are always:
- JSON manifests
- structured metadata
- normalized fields
Not just raw text dumps.
SEO as a First-Class Concern
Every page includes:
- meta tags
- OpenGraph / social metadata
- keyword tagging
- canonical URLs
Filesystem as Source of Truth
- directory structure = content hierarchy
- no database required
- easily deployable as static site
🔹 Why This Exists
Traditional PDF workflows:
- store documents as opaque blobs
- lack searchability
- lack metadata
- are not web-native
abstract_pdfs transforms PDFs into:
- structured, indexable content
- web-ready assets
- searchable knowledge bases
🔹 Example Use Cases
- PDF → website publishing pipelines
- document archives (research, legal, media)
- SEO-driven content platforms
- knowledge base generation
- preprocessing for LLM / search systems
🔹 Integration Context
This system integrates with:
- OCR pipelines (layout_ocr / abstract_ocr)
- NLP systems (abstract_hugpy)
- static hosting (Nginx / CDN)
- search indexing systems
🔹 Design Philosophy
- Documents are data, not files
- Structure before presentation
- Metadata is as important as content
- Static outputs scale better than dynamic systems