No description

Python 100%

Find a file

solcatcher 643eaafe71 baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)		2026-06-26 20:04:36 -05:00
src/abstract_webtools	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
.gitignore	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
PKG-INFO	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
pyproject.toml	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
README.md	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
setup.cfg	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00
setup.py	baseline: abstract_webtools 0.1.6.430 (from PyPI sdist)	2026-06-26 20:04:36 -05:00

README.md

Abstract WebTools

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.

Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher configuration, user‑agent rotation, retries, HTML parsing, link extraction, crawling, headless browsers, and media downloading — behind a set of small, composable managers. The managers share a single URL → request → soup pipeline, so a page is fetched once and reused everywhere downstream instead of being re‑fetched by every layer.

Author: putkoff (Abstract Endeavors)
Source: https://github.com/AbstractEndeavors/abstract_webtools
Python: 3.8+
License: MIT

Why
Install
Quick start
Architecture: the manager chain
The managers
Common recipes
Design notes
Testing
Contributing

Why

Most scraping code re‑implements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.

Abstract WebTools factors each concern into a manager and threads shared instances through the chain. Pass an existing req_mgr (or source code) into any higher‑level manager and it is reused as‑is — no rebuild, no second network request.

Install

pip install abstract_webtools

Optional extras:

pip install "abstract_webtools[drivers]"   # selenium + webdriver-manager
pip install "abstract_webtools[media]"     # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]"       # PyQt/PySimpleGUI helpers

Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media features pull in selenium / playwright / yt-dlp as needed.

Quick start

from abstract_webtools import get_soup, get_source, linkManager

# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)

# Just the raw HTML
html = get_source("https://example.com")

# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)

Architecture: the manager chain

The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:

urlManager        normalize / validate / vary URLs
   └─ requestManager   sessions, retries, TLS, UA  ── networkManager ┐
        └─ soupManager       BeautifulSoup parsing                   ├─ userAgentManager
             ├─ linkManager       link / image extraction            ├─ cipherManager
             └─ crawlManager       site crawling / sitemaps          └─ sslManager + tlsAdapter

Every layer has a matching factory function that detects and reuses an existing instance:

Factory	Returns	Reuses when given
`get_url_mgr(url=, url_mgr=)`	`urlManager`	`url_mgr`
`get_req_mgr(url=, url_mgr=, source_code=, req_mgr=)`	`requestManager`	`req_mgr`
`get_source(...)`	HTML string	`source_code` / `req_mgr`
`get_soup_mgr(...)`	`soupManager`	`soup_mgr` / `req_mgr`
`get_soup(...)`	`BeautifulSoup`	`soup` / `soup_mgr` / `source_code`
`get_crawl_mgr(...)`	`crawlManager`	`req_mgr` / `url_mgr`
`get_managed_session(...)`	`requests.Session`	`req_mgr`

Because every factory short‑circuits on an instance you pass in, the whole chain is built once and shared:

from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager

req = get_req_mgr("https://example.com")        # fetches once
soup_mgr = get_soup_mgr(req_mgr=req)            # no re-fetch
links = linkManager(req_mgr=req)                # no re-fetch

The managers

Manager	Responsibility
urlManager	Parse, validate, normalize and generate URL variants.
requestManager	`requests.Session` with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback.
networkManager	Mounts the TLS adapter and wires proxies/cookies/UA into the session.
userAgentManager	Realistic user agents and per‑URL headers (random or pinned by OS/browser).
cipherManager	Cipher‑suite strings for TLS.
sslManager / tlsAdapter	SSL context + `HTTPAdapter` for fine‑grained TLS control.
soupManager	BeautifulSoup parsing, meta/link extraction, attribute discovery.
linkManager	Internal/image link extraction with desired/undesired filters.
crawlManager	Recursive crawling, sitemap generation, domain link discovery.
middleManager	`UnifiedWebManager` — one lazy facade over the whole chain.
usurpManager	Full‑site mirror: pages + assets + styles, references rewritten for offline use.
videoDownloader	Video/media download via `yt-dlp` / `m3u8`, wired to the managed session/UA.
seleneumManager / playwriteManager	Headless‑browser source fetching for JS‑rendered pages.

Common recipes

Get a page's source / soup

from abstract_webtools import get_source, get_soup, get_soup_mgr

html = get_source("https://example.com")
soup = get_soup("https://example.com")

# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)

# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))

Extract links

from abstract_webtools import linkManager

lm = linkManager(
    "https://example.com",
    link_attr_value_desired=["/blog/"],      # keep only links containing this
    image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain())                  # unique domains found

Crawl a site

from abstract_webtools import get_crawl_mgr, get_domain_crawl

crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)

One shared context: `UnifiedWebManager`

UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code, soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.

from abstract_webtools import UnifiedWebManager

web = UnifiedWebManager("https://example.com")
web.url_mgr      # built on demand
web.source_code  # fetched once
web.soup         # parsed once
web.link_mgr     # shares the same chain — no re-fetch
web.crawl_mgr

# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title

A managed `requests.Session`

Need a plain session, but configured with a real user agent, ciphers, the TLS adapter and proxies? Ask the stack for one — it never fetches just to build it, and reuses an existing req_mgr's session when given:

from abstract_webtools import get_managed_session

session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")

Mirror an entire site (`usurpManager`)

usurpManager saves a working offline copy of a site — pages and styles intact. By default it recursively captures the whole site: every same‑domain page link and all referenced media. It follows CSS url(...) / @import (including @font-face and cross‑domain CDN fonts), handles srcset, inline style="" and <style> blocks, downloads scripts/images/linked files, and rewrites every reference to a relative local path so the result renders straight from file://.

from abstract_webtools import usurpit

# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")

Or drive it directly for more control:

from abstract_webtools import usurpManager, get_req_mgr

req = get_req_mgr("https://example.com")
site = usurpManager(
    "https://example.com",
    req_mgr=req,                      # reuse the managed session
    output_dir="example_mirror",
    max_depth=None,                   # default: unlimited (whole site); set an int to cap
    mirror_external_assets=True,      # pull CDN css/fonts so styles work (default)
)
summary = site.main()

The crawl is breadth‑first and unlimited‑depth by default (max_depth=None); the visited‑set keeps it finite/loop‑free. Pass an integer max_depth to bound it.
Pages are mirrored within the origin host; referenced assets may come from CDNs (set mirror_external_assets=False to stay strictly on‑origin).
A single url → local path map keeps references consistent and shared assets are fetched exactly once.
For heavily JS‑rendered sites, fetch the rendered HTML first via seleneumManager / playwriteManager.

Download video / media

from abstract_webtools import get_video_info, downloadvideo

info = get_video_info("https://www.youtube.com/watch?v=...")   # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")

The downloader pulls its user agent (and proxy) from the shared request stack and threads them into yt-dlp, so downloads use the same identity as the rest of your scrape. You can inject an existing req_mgr / ua_mgr:

from abstract_webtools import VideoDownloader, get_req_mgr

req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
                download_directory="videos")

Design notes

Reuse over rebuild. Every factory and constructor honors an instance you pass in. Supplying source_code or a req_mgr means zero extra network requests downstream.
One session, fully configured. TLS ciphers, SSL context, the HTTP adapter, user agent, proxies and cookies are assembled once by the request stack and reused — including by usurpManager and videoDownloader.
Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.

Testing

The repo ships dependency‑light regression tests (only requests + beautifulsoup4 required) that load the real modules under a controlled namespace and assert the no‑refetch behavior and the site mirror:

python tests/test_manager_chain.py        # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py    # managed session for video + usurp
python tests/test_usurp_mirror.py         # full-site mirror with styles

Contributing

Issues and PRs welcome at AbstractEndeavors/abstract_webtools. Please keep new functionality threaded through the shared manager chain (accept and reuse url_mgr / req_mgr / source_code) rather than re‑fetching, and add a dependency‑light test where practical.

README.md Unescape Escape