abstract_pdfs

No description

Python 100%

Find a file

solcatcher 731c1e1476 baseline: abstract_pdfs 0.0.39 (from PyPI sdist)		2026-06-26 20:04:33 -05:00
src/abstract_pdfs	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
tests	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
.gitignore	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
PKG-INFO	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
pyproject.toml	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
README.md	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
setup.cfg	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00
setup.py	baseline: abstract_pdfs 0.0.39 (from PyPI sdist)	2026-06-26 20:04:33 -05:00

README.md

Part of the Abstract Intelligence Platform

This module is part of a unified system for transforming raw media into structured, searchable, and SEO-optimized data.

abstract_pdfs handles document ingestion and publishing:

PDF → structured pages (text + images)
metadata + manifest generation
static HTML output (viewer + gallery)

Full system: https://github.com/AbstractEndeavors/abstract-intelligence

abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content

A structured pipeline for transforming PDFs into searchable, metadata-rich, web-ready content, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

Designed for:

large PDF collections
SEO-driven content indexing
document-to-web publishing pipelines
structured ingestion of unstructured media

🔹 What This System Is

abstract_pdfs is not a PDF utility — it is a full document processing pipeline:

ingests raw PDFs
decomposes them into pages, images, and text
extracts and generates metadata
enriches content via NLP APIs
builds structured outputs (JSON + HTML)
generates navigable web content (galleries + viewers)

The result is a fully browsable, searchable document corpus.

🔹 Pipeline Overview

PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    ├─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    ├─ gallery index pages
    ↓
Static Site Output (SEO-ready)

flowchart TD
    A[PDF Input]
    B[DocumentPipeline]
    C[SliceManager\nPage Images + Text + OCR]
    D[Per-Page Assets\nThumbnails / Text / Info JSON]
    E[Manifest Generation\nPage + Document Metadata]
    F[NLP Enrichment\nSummaries + Keywords + Descriptions]
    G[HTML Generation\nViewer Pages + Gallery Indexes]
    H[Static Output\nSearchable / SEO-ready PDF Corpus]

    A --> B --> C --> D --> E --> F --> G --> H

🔹 Core Capabilities

Document Decomposition

Splits PDFs into:
- page images
- extracted text
- structured page directories
Maintains consistent directory structure for downstream processing

Metadata & SEO Enrichment

Generates:
- summaries
- keywords
- descriptions
Integrates with NLP endpoints for:
- text analysis
- keyword refinement
- summarization

Example: page-level analysis via API calls

Manifest Generation

Produces structured JSON per page:
- metadata
- text
- image references
- SEO fields
Aggregates into document-level manifests

Static Site Generation

Generates:
- PDF viewer pages (page-by-page navigation)
- gallery index pages (directory browsing)
Automatically builds:
- thumbnails
- descriptions
- keyword tags

Example: dynamic card generation for directories

Path ↔ URL Mapping

Converts filesystem structure into web-accessible URLs
Maintains consistency between:
- local storage (/srv/media/...)
- public endpoints (/pdfs/...)

Content Structuring

Page-level:
- text
- summary
- keywords
Document-level:
- aggregated metadata
- full-text indexing

🔹 Architecture

The system is composed of modular components:

DocumentPipeline
- orchestrates ingestion → processing → output
SliceManager
- handles PDF decomposition and OCR
Manifest Generators
- build structured JSON representations
HTML Generators
- render viewer and gallery pages
Metadata Utilities
- enrich content via external NLP services

Each stage is:

independent
composable
replaceable

🔹 Key Design Decisions

Page-Level First

All processing happens per-page, enabling:

granular indexing
targeted metadata
scalable processing

Structured Over Raw

Outputs are always:

JSON manifests
structured metadata
normalized fields

Not just raw text dumps.

SEO as a First-Class Concern

Every page includes:

meta tags
OpenGraph / social metadata
keyword tagging
canonical URLs

Filesystem as Source of Truth

directory structure = content hierarchy
no database required
easily deployable as static site

🔹 Why This Exists

Traditional PDF workflows:

store documents as opaque blobs
lack searchability
lack metadata
are not web-native

abstract_pdfs transforms PDFs into:

structured, indexable content
web-ready assets
searchable knowledge bases

🔹 Example Use Cases

PDF → website publishing pipelines
document archives (research, legal, media)
SEO-driven content platforms
knowledge base generation
preprocessing for LLM / search systems

🔹 Integration Context

This system integrates with:

OCR pipelines (layout_ocr / abstract_ocr)
NLP systems (abstract_hugpy)
static hosting (Nginx / CDN)
search indexing systems

🔹 Design Philosophy

Documents are data, not files
Structure before presentation
Metadata is as important as content
Static outputs scale better than dynamic systems