- Python 100%
| src/abstract_webtools | ||
| .gitignore | ||
| PKG-INFO | ||
| pyproject.toml | ||
| README.md | ||
| setup.cfg | ||
| setup.py | ||
Abstract WebTools
Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.
Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher
configuration, user‑agent rotation, retries, HTML parsing, link extraction,
crawling, headless browsers, and media downloading — behind a set of small,
composable managers. The managers share a single URL → request → soup
pipeline, so a page is fetched once and reused everywhere downstream instead
of being re‑fetched by every layer.
- Author: putkoff (Abstract Endeavors)
- Source: https://github.com/AbstractEndeavors/abstract_webtools
- Python: 3.8+
- License: MIT
Table of contents
- Why
- Install
- Quick start
- Architecture: the manager chain
- The managers
- Common recipes
- Design notes
- Testing
- Contributing
Why
Most scraping code re‑implements the same plumbing on every project: building a session, picking a user agent, tuning TLS so a server doesn't reject you, handling retries, parsing HTML, then doing it all again for the next step.
Abstract WebTools factors each concern into a manager and threads shared
instances through the chain. Pass an existing req_mgr (or source code) into
any higher‑level manager and it is reused as‑is — no rebuild, no second network
request.
Install
pip install abstract_webtools
Optional extras:
pip install "abstract_webtools[drivers]" # selenium + webdriver-manager
pip install "abstract_webtools[media]" # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]" # PyQt/PySimpleGUI helpers
Core runtime deps: requests, urllib3, beautifulsoup4. Browser and media
features pull in selenium / playwright / yt-dlp as needed.
Quick start
from abstract_webtools import get_soup, get_source, linkManager
# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)
# Just the raw HTML
html = get_source("https://example.com")
# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)
Architecture: the manager chain
The core managers form a layered pipeline. Each layer accepts the layer(s) below it and reuses them when provided:
urlManager normalize / validate / vary URLs
└─ requestManager sessions, retries, TLS, UA ── networkManager ┐
└─ soupManager BeautifulSoup parsing ├─ userAgentManager
├─ linkManager link / image extraction ├─ cipherManager
└─ crawlManager site crawling / sitemaps └─ sslManager + tlsAdapter
Every layer has a matching factory function that detects and reuses an existing instance:
| Factory | Returns | Reuses when given |
|---|---|---|
get_url_mgr(url=, url_mgr=) |
urlManager |
url_mgr |
get_req_mgr(url=, url_mgr=, source_code=, req_mgr=) |
requestManager |
req_mgr |
get_source(...) |
HTML string | source_code / req_mgr |
get_soup_mgr(...) |
soupManager |
soup_mgr / req_mgr |
get_soup(...) |
BeautifulSoup |
soup / soup_mgr / source_code |
get_crawl_mgr(...) |
crawlManager |
req_mgr / url_mgr |
get_managed_session(...) |
requests.Session |
req_mgr |
Because every factory short‑circuits on an instance you pass in, the whole chain is built once and shared:
from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager
req = get_req_mgr("https://example.com") # fetches once
soup_mgr = get_soup_mgr(req_mgr=req) # no re-fetch
links = linkManager(req_mgr=req) # no re-fetch
The managers
| Manager | Responsibility |
|---|---|
| urlManager | Parse, validate, normalize and generate URL variants. |
| requestManager | requests.Session with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback. |
| networkManager | Mounts the TLS adapter and wires proxies/cookies/UA into the session. |
| userAgentManager | Realistic user agents and per‑URL headers (random or pinned by OS/browser). |
| cipherManager | Cipher‑suite strings for TLS. |
| sslManager / tlsAdapter | SSL context + HTTPAdapter for fine‑grained TLS control. |
| soupManager | BeautifulSoup parsing, meta/link extraction, attribute discovery. |
| linkManager | Internal/image link extraction with desired/undesired filters. |
| crawlManager | Recursive crawling, sitemap generation, domain link discovery. |
| middleManager | UnifiedWebManager — one lazy facade over the whole chain. |
| usurpManager | Full‑site mirror: pages + assets + styles, references rewritten for offline use. |
| videoDownloader | Video/media download via yt-dlp / m3u8, wired to the managed session/UA. |
| seleneumManager / playwriteManager | Headless‑browser source fetching for JS‑rendered pages. |
Common recipes
Get a page's source / soup
from abstract_webtools import get_source, get_soup, get_soup_mgr
html = get_source("https://example.com")
soup = get_soup("https://example.com")
# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)
# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))
Extract links
from abstract_webtools import linkManager
lm = linkManager(
"https://example.com",
link_attr_value_desired=["/blog/"], # keep only links containing this
image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain()) # unique domains found
Crawl a site
from abstract_webtools import get_crawl_mgr, get_domain_crawl
crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)
One shared context: UnifiedWebManager
UnifiedWebManager lazily builds and caches url_mgr, req_mgr, source_code,
soup_mgr, soup, plus link_mgr / crawl_mgr — all over a single fetch.
from abstract_webtools import UnifiedWebManager
web = UnifiedWebManager("https://example.com")
web.url_mgr # built on demand
web.source_code # fetched once
web.soup # parsed once
web.link_mgr # shares the same chain — no re-fetch
web.crawl_mgr
# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title
A managed requests.Session
Need a plain session, but configured with a real user agent, ciphers, the TLS
adapter and proxies? Ask the stack for one — it never fetches just to build it,
and reuses an existing req_mgr's session when given:
from abstract_webtools import get_managed_session
session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")
Mirror an entire site (usurpManager)
usurpManager saves a working offline copy of a site — pages and styles
intact. By default it recursively captures the whole site: every
same‑domain page link and all referenced media. It follows CSS url(...) /
@import (including @font-face and cross‑domain CDN fonts), handles srcset,
inline style="" and <style> blocks, downloads scripts/images/linked files,
and rewrites every reference to a relative local path so the result renders
straight from file://.
from abstract_webtools import usurpit
# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")
Or drive it directly for more control:
from abstract_webtools import usurpManager, get_req_mgr
req = get_req_mgr("https://example.com")
site = usurpManager(
"https://example.com",
req_mgr=req, # reuse the managed session
output_dir="example_mirror",
max_depth=None, # default: unlimited (whole site); set an int to cap
mirror_external_assets=True, # pull CDN css/fonts so styles work (default)
)
summary = site.main()
- The crawl is breadth‑first and unlimited‑depth by default (
max_depth=None); the visited‑set keeps it finite/loop‑free. Pass an integermax_depthto bound it. - Pages are mirrored within the origin host; referenced assets may come from
CDNs (set
mirror_external_assets=Falseto stay strictly on‑origin). - A single
url → local pathmap keeps references consistent and shared assets are fetched exactly once. - For heavily JS‑rendered sites, fetch the rendered HTML first via
seleneumManager/playwriteManager.
Download video / media
from abstract_webtools import get_video_info, downloadvideo
info = get_video_info("https://www.youtube.com/watch?v=...") # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")
The downloader pulls its user agent (and proxy) from the shared request stack
and threads them into yt-dlp, so downloads use the same identity as the rest
of your scrape. You can inject an existing req_mgr / ua_mgr:
from abstract_webtools import VideoDownloader, get_req_mgr
req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
download_directory="videos")
Design notes
- Reuse over rebuild. Every factory and constructor honors an instance you
pass in. Supplying
source_codeor areq_mgrmeans zero extra network requests downstream. - One session, fully configured. TLS ciphers, SSL context, the HTTP adapter,
user agent, proxies and cookies are assembled once by the request stack and
reused — including by
usurpManagerandvideoDownloader. - Optional heavy deps stay optional. Browser/media/GUI extras are imported defensively so the core package imports without them.
Testing
The repo ships dependency‑light regression tests (only requests +
beautifulsoup4 required) that load the real modules under a controlled
namespace and assert the no‑refetch behavior and the site mirror:
python tests/test_manager_chain.py # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py # managed session for video + usurp
python tests/test_usurp_mirror.py # full-site mirror with styles
Contributing
Issues and PRs welcome at
AbstractEndeavors/abstract_webtools.
Please keep new functionality threaded through the shared manager chain (accept
and reuse url_mgr / req_mgr / source_code) rather than re‑fetching, and add
a dependency‑light test where practical.