- Python 100%
| src/media_intelligence | ||
| .gitignore | ||
| PKG-INFO | ||
| pyproject.toml | ||
| README.md | ||
| setup.cfg | ||
media_intelligence — Abstract Intelligence Platform
A unified, layered facade that turns raw media — PDFs, images, and video — into structured, searchable, SEO-ready data. It does not reimplement any engine: it selects the best function of each sibling package and exposes it behind one clean, lazy API, plus an orchestrated pipeline.
Raw Media (PDF / Image / Video / URL)
│
▼
ingest → extract → structure → enrich → persist → publish
(webtools) (ocr/ (typed (hugpy) (FS / DB) (react/
pdfs/ metadata) nginx)
videos)
Layers → canonical owners
| Layer | Owner package | What it does |
|---|---|---|
ingest |
abstract_webtools |
scrape pages, download video (yt-dlp/ffmpeg) |
ocr |
abstract_ocr |
layout-aware, multi-engine OCR |
documents |
abstract_pdfs |
PDF decomposition + manifests + HTML |
video |
abstract_videos |
registry pipeline: download/frames/transcribe |
transcribe |
hugpy (→ abstract_ocr fallback) |
Whisper speech-to-text |
enrich |
hugpy |
summaries, keywords, vision captioning, SEO |
persist |
filesystem (DB-pluggable) | typed JSON/JSONB manifests |
publish |
abstract_react + abstract_nginx |
SEO/OG metadata + static HTML |
Overlapping capabilities are resolved to one owner (Whisper → hugpy;
video download → webtools; summarize/keywords → hugpy).
Install
media_intelligence is just this src/ facade — it contains none of the
engines. Each layer's owner is its own PyPI package, declared as an optional
extra, so you install only what you use:
pip install media_intelligence # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]" # just those layers
pip install "media_intelligence[all]" # the full platform
The package has no required third-party dependencies: importing it is cheap
(~20 ms) and pulls none of the backing packages. Each sibling is imported
lazily, only when its layer is actually called; a missing one raises a clear
MissingDependency naming the extra to install.
Check what's usable in the current environment without importing anything:
import media_intelligence as mi
mi.available() # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich") # True / False
Usage
Direct namespace access
import media_intelligence as mi
text = mi.ocr.image_to_text("page.png")
kw = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")
Orchestrated pipeline (idempotent + resumable)
from media_intelligence import MediaPipeline
pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
# ... or simply:
pipe.run()
The pipeline autodetects media kind, dispatches each stage accordingly, skips
stages already satisfied (idempotent), and rehydrates from a prior manifest on
re-run (resumable). Results land in out_root/<media_id>/manifest.json.
Persistence (DB-pluggable, two records)
Each item is persisted as two records so indexing stays cheap while aggregation stays simple:
manifest.json— lean index: ids, counts,text_chars, summary, keywords, SEO, asset pointers. (The JSONB metadata row.)document.json— canonical content: fulltext,pages/segments,transcript. The single source of truth for search / aggregation / LLM datasets — one read per item, no re-stitching of per-owner on-disk files.
store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest) # lean index
store.save_document(item.media_id, document) # full body
doc = store.load_document(item.media_id) # aggregation reads this
# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...) # planned (abstract_database)
# -> metadata in JSONB, body text in a full-text-indexed column
MediaPipeline.persist() writes both. On re-run, the body is rehydrated from
document.json, so extract/enrich skip (no re-OCR / re-transcribe).