No description
Find a file
2026-06-26 20:04:36 -05:00
src/abstract_webtools baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
.gitignore baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
PKG-INFO baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
pyproject.toml baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
README.md baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
setup.cfg baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00
setup.py baseline: abstract_webtools 0.1.6.430 (from PyPI sdist) 2026-06-26 20:04:36 -05:00

Abstract WebTools

Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.

Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher configuration, useragent rotation, retries, HTML parsing, link extraction, crawling, headless browsers, and media downloading — behind a set of small, composable managers. The managers share a single URL → request → soup pipeline, so a page is fetched once and reused everywhere downstream instead of being refetched by every layer.


Table of contents


Why

Most scraping code reimplements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.

Abstract WebTools factors each concern into a manager and threads shared instances through the chain. Pass an existing req_mgr (or source code) into any higherlevel manager and it is reused asis — no rebuild, no second network request.


Install

pip install abstract_webtools

Optional extras:

pip install "abstract_webtools[drivers]"   # selenium + webdriver-manager
pip install "abstract_webtools[media]"     # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]"       # PyQt/PySimpleGUI helpers

Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media features pull in selenium / playwright / yt-dlp as needed.


Quick start

from abstract_webtools import get_soup, get_source, linkManager

# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)

# Just the raw HTML
html = get_source("https://example.com")

# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)

Architecture: the manager chain

The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:

urlManager        normalize / validate / vary URLs
   └─ requestManager   sessions, retries, TLS, UA  ── networkManager ┐
        └─ soupManager       BeautifulSoup parsing                   ├─ userAgentManager
             ├─ linkManager       link / image extraction            ├─ cipherManager
             └─ crawlManager       site crawling / sitemaps          └─ sslManager + tlsAdapter

Every layer has a matching factory function that detects and reuses an existing instance:

Factory Returns Reuses when given
get_url_mgr(url=, url_mgr=) urlManager url_mgr
get_req_mgr(url=, url_mgr=, source_code=, req_mgr=) requestManager req_mgr
get_source(...) HTML string source_code / req_mgr
get_soup_mgr(...) soupManager soup_mgr / req_mgr
get_soup(...) BeautifulSoup soup / soup_mgr / source_code
get_crawl_mgr(...) crawlManager req_mgr / url_mgr
get_managed_session(...) requests.Session req_mgr

Because every factory shortcircuits on an instance you pass in, the whole chain is built once and shared:

from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager

req = get_req_mgr("https://example.com")        # fetches once
soup_mgr = get_soup_mgr(req_mgr=req)            # no re-fetch
links = linkManager(req_mgr=req)                # no re-fetch

The managers

Manager Responsibility
urlManager Parse, validate, normalize and generate URL variants.
requestManager requests.Session with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback.
networkManager Mounts the TLS adapter and wires proxies/cookies/UA into the session.
userAgentManager Realistic user agents and perURL headers (random or pinned by OS/browser).
cipherManager Ciphersuite strings for TLS.
sslManager / tlsAdapter SSL context + HTTPAdapter for finegrained TLS control.
soupManager BeautifulSoup parsing, meta/link extraction, attribute discovery.
linkManager Internal/image link extraction with desired/undesired filters.
crawlManager Recursive crawling, sitemap generation, domain link discovery.
middleManager UnifiedWebManager — one lazy facade over the whole chain.
usurpManager Fullsite mirror: pages + assets + styles, references rewritten for offline use.
videoDownloader Video/media download via yt-dlp / m3u8, wired to the managed session/UA.
seleneumManager / playwriteManager Headlessbrowser source fetching for JSrendered pages.

Common recipes

Get a page's source / soup

from abstract_webtools import get_source, get_soup, get_soup_mgr

html = get_source("https://example.com")
soup = get_soup("https://example.com")

# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)

# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))
from abstract_webtools import linkManager

lm = linkManager(
    "https://example.com",
    link_attr_value_desired=["/blog/"],      # keep only links containing this
    image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain())                  # unique domains found

Crawl a site

from abstract_webtools import get_crawl_mgr, get_domain_crawl

crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)

One shared context: UnifiedWebManager

UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code, soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.

from abstract_webtools import UnifiedWebManager

web = UnifiedWebManager("https://example.com")
web.url_mgr      # built on demand
web.source_code  # fetched once
web.soup         # parsed once
web.link_mgr     # shares the same chain — no re-fetch
web.crawl_mgr

# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title

A managed requests.Session

Need a plain session, but configured with a real user agent, ciphers, the TLS adapter and proxies? Ask the stack for one — it never fetches just to build it, and reuses an existing req_mgr's session when given:

from abstract_webtools import get_managed_session

session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")

Mirror an entire site (usurpManager)

usurpManager saves a working offline copy of a site — pages and styles intact. By default it recursively captures the whole site: every samedomain page link and all referenced media. It follows CSS url(...) / @import (including @font-face and crossdomain CDN fonts), handles srcset, inline style="" and <style> blocks, downloads scripts/images/linked files, and rewrites every reference to a relative local path so the result renders straight from file://.

from abstract_webtools import usurpit

# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")

Or drive it directly for more control:

from abstract_webtools import usurpManager, get_req_mgr

req = get_req_mgr("https://example.com")
site = usurpManager(
    "https://example.com",
    req_mgr=req,                      # reuse the managed session
    output_dir="example_mirror",
    max_depth=None,                   # default: unlimited (whole site); set an int to cap
    mirror_external_assets=True,      # pull CDN css/fonts so styles work (default)
)
summary = site.main()
  • The crawl is breadthfirst and unlimiteddepth by default (max_depth=None); the visitedset keeps it finite/loopfree. Pass an integer max_depth to bound it.
  • Pages are mirrored within the origin host; referenced assets may come from CDNs (set mirror_external_assets=False to stay strictly onorigin).
  • A single url → local path map keeps references consistent and shared assets are fetched exactly once.
  • For heavily JSrendered sites, fetch the rendered HTML first via seleneumManager / playwriteManager.

Download video / media

from abstract_webtools import get_video_info, downloadvideo

info = get_video_info("https://www.youtube.com/watch?v=...")   # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")

The downloader pulls its user agent (and proxy) from the shared request stack and threads them into yt-dlp, so downloads use the same identity as the rest of your scrape. You can inject an existing req_mgr / ua_mgr:

from abstract_webtools import VideoDownloader, get_req_mgr

req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
                download_directory="videos")

Design notes

  • Reuse over rebuild. Every factory and constructor honors an instance you pass in. Supplying source_code or a req_mgr means zero extra network requests downstream.
  • One session, fully configured. TLS ciphers, SSL context, the HTTP adapter, user agent, proxies and cookies are assembled once by the request stack and reused — including by usurpManager and videoDownloader.
  • Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.

Testing

The repo ships dependencylight regression tests (only requests + beautifulsoup4 required) that load the real modules under a controlled namespace and assert the norefetch behavior and the site mirror:

python tests/test_manager_chain.py        # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py    # managed session for video + usurp
python tests/test_usurp_mirror.py         # full-site mirror with styles

Contributing

Issues and PRs welcome at AbstractEndeavors/abstract_webtools. Please keep new functionality threaded through the shared manager chain (accept and reuse url_mgr / req_mgr / source_code) rather than refetching, and add a dependencylight test where practical.