# Scrapling-main **Repository Path**: jiang_dn/scrapling-main ## Basic Information - **Project Name**: Scrapling-main - **Description**: No description available - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: dev - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-28 - **Last Updated**: 2026-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

面向现代 Web 的轻松 Web Scraping

选择方法 · Fetchers · Spiders · 代理轮换 · CLI · MCP

Scrapling 是一个自适应 Web Scraping 框架，既能处理单次请求，也能支持全规模爬取。它的解析器会学习网站变化，并在页面更新时自动重新定位你的元素。它的 fetchers 开箱即用即可绕过 Cloudflare Turnstile 之类的反机器人系统。它的 spider 框架则让你只需几行 Python，就能扩展到支持暂停/恢复和自动代理轮换的并发、多会话爬取。一个库，零妥协。它能以实时统计和流式输出实现极速爬取。由 Web Scrapers 为 Web Scrapers 和普通用户打造，每个人都能从中获益。 ```python from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher StealthyFetcher.adaptive = True p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar! products = p.css('.product', auto_save=True) # Scrape data that survives website design changes! products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them! ``` 或者扩展到完整爬取 ```python from scrapling.spiders import Spider, Response class MySpider(Spider): name = "demo" start_urls = ["https://example.com/"] async def parse(self, response: Response): for item in response.css('.product'): yield {"title": item.css('h2::text').get()} MySpider().start() ```

# 白金赞助商

	Scrapling 可以处理 Cloudflare Turnstile。对于企业级防护， Hyper Solutions 提供可生成有效反机器人令牌的 API 端点，适用于 Akamai、DataDome、Kasada 和 Incapsula。只需简单的 API 调用，无需浏览器自动化。
	嘿，我们打造了 BirdProxies ，因为代理服务不该复杂也不该虚高定价。覆盖 195+ 地点的高速住宅代理和 ISP 代理，价格公道，并提供真正的支持。免费试玩我们落地页上的 FlappyBird 游戏，赢取免费流量！
	Evomi : 住宅代理低至 $0.49/GB。提供带完整伪装 Chromium 的抓取浏览器、住宅 IP、自动 CAPTCHA 求解和反机器人绕过。提供轻松获取结果的 Scraper API。也支持 MCP 和 N8N 集成。
	TikHub.io 提供覆盖 16+ 平台的 900+ 稳定 API，包括 TikTok、X、YouTube 和 Instagram，并拥有 4000 万+ 数据集。还提供折扣 AI 模型，Claude、GPT、GEMINI 等最高可省 71%。
	Nsocks 为开发者和抓取者提供快速的住宅代理和 ISP 代理。具备全球 IP 覆盖、高匿名性、智能轮换以及适用于自动化和数据提取的可靠性能。使用 Xcrawl 可简化大规模 Web crawling。
	合上你的笔记本电脑，你的抓取器仍会继续运行。 PetroSky VPS 是为不间断自动化打造的云服务器。提供完全可控的 Windows 和 Linux 机器。月费 €6.99 起。
	在 The Web Scraping Club 阅读对 Scrapling 的完整评测（2025 年 11 月），这是专注于 Web Scraping 的头号通讯。
	Proxy-Seller 为 Web Scraping 提供可靠的代理基础设施，提供 IPv4、IPv6、ISP、住宅和移动代理，具有稳定性能、广泛地理覆盖以及适合企业级数据采集的灵活方案。

_{想在这里展示你的广告吗？点击[这里](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)} # 赞助商

_{想在这里展示你的广告吗？点击[这里](https://github.com/sponsors/D4Vinci)并选择适合你的层级！} --- ## 关键特性 ### Spiders：完整爬取框架 - 🕷️ **类 Scrapy 的 Spider API**：使用 `start_urls`、异步 `parse` 回调以及 `Request` / `Response` 对象来定义 spiders。 - ⚡ **并发爬取**：可配置的并发限制、按域限速以及下载延迟。 - 🔄 **多会话支持**：为 HTTP 请求和 stealth 无头浏览器提供统一接口，可在单个 spider 中按 ID 将请求路由到不同会话。 - 💾 **暂停与恢复**：基于 checkpoint 的爬取持久化。按下 Ctrl+C 可优雅关闭；重启后可从中断处恢复。 - 📡 **流式模式**：通过 `async for item in spider.stream()` 在抓取结果到达时流式获取数据，并附带实时统计，非常适合 UI、pipeline 和长时间运行的爬取任务。 - 🛡️ **被拦截请求检测**：自动检测并重试被拦截的请求，并支持自定义逻辑。 - 📦 **内置导出**：可通过 hooks 和你自己的 pipeline，或通过内置 JSON/JSONL 分别使用 `result.items.to_json()` / `result.items.to_jsonl()` 导出结果。 ### 带会话支持的高级网站抓取 - **HTTP 请求**：使用 `Fetcher` 类进行快速且 stealth 的 HTTP 请求。可伪装浏览器的 TLS 指纹、请求头，并支持 HTTP/3。 - **动态加载**：通过 `DynamicFetcher` 类使用完整浏览器自动化抓取动态网站，支持 Playwright 的 Chromium 和 Google Chrome。 - **反机器人绕过**：`StealthyFetcher` 提供高级 stealth 能力和指纹伪装。可轻松通过自动化绕过各类 Cloudflare Turnstile / Interstitial。 - **会话管理**：提供 `FetcherSession`、`StealthySession` 和 `DynamicSession` 类，以在请求之间持久化 cookie 和状态。 - **代理轮换**：内置 `ProxyRotator`，支持循环或自定义轮换策略，适用于所有会话类型，并支持按请求覆盖代理。 - **域名拦截**：可在基于浏览器的 fetchers 中阻止对特定域名（及其子域名）的请求。 - **异步支持**：所有 fetchers 都提供完整的异步支持，并配有专用异步会话类。 ### 自适应抓取与 AI 集成 - 🔄 **智能元素跟踪**：使用智能相似度算法，在网站变化后重新定位元素。 - 🎯 **智能灵活选择**：支持 CSS 选择器、XPath 选择器、基于过滤器的搜索、文本搜索、正则搜索等。 - 🔍 **查找相似元素**：自动定位与已找到元素相似的元素。 - 🤖 **可与 AI 配合使用的 MCP 服务器**：内置 MCP 服务器，用于 AI 辅助的 Web Scraping 和数据提取。该 MCP 服务器具备强大的自定义能力，可在将内容传递给 AI（Claude/Cursor 等）之前，利用 Scrapling 先提取目标内容，从而通过减少 token 使用加快操作并降低成本。([演示视频](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) ### 高性能且经过实战检验的架构 - 🚀 **极速**：性能经过优化，优于大多数 Python 抓取库。 - 🔋 **内存高效**：优化的数据结构和惰性加载机制，带来极小的内存占用。 - ⚡ **快速 JSON 序列化**：速度比标准库快 10 倍。 - 🏗️ **经过实战检验**：Scrapling 不仅拥有 92% 的测试覆盖率和完整的类型提示覆盖，还在过去一年中被数百位 Web Scrapers 每日使用。 ### 面向开发者 / Web Scraper 的友好体验 - 🎯 **交互式 Web Scraping Shell**：可选的内置 IPython shell，集成 Scrapling、快捷方式和新工具，可加速 Web Scraping 脚本开发，例如把 curl 请求转换成 Scrapling 请求，并在浏览器中查看请求结果。 - 🚀 **直接在终端中使用**：你也可以选择无需编写任何代码，直接使用 Scrapling 抓取一个 URL！ - 🛠️ **丰富的导航 API**：支持父节点、兄弟节点和子节点导航的高级 DOM 遍历。 - 🧬 **增强的文本处理**：内置正则、清洗方法和优化的字符串操作。 - 📝 **自动生成选择器**：可为任意元素生成稳健的 CSS / XPath 选择器。 - 🔌 **熟悉的 API**：与 Scrapy / BeautifulSoup 相似，并使用与 Scrapy / Parsel 相同的 pseudo-elements。 - 📘 **完整类型覆盖**：提供完整的类型提示，以获得出色的 IDE 支持和代码补全。整个代码库会在每次变更时由 **PyRight** 和 **MyPy** 自动扫描。 - 🔋 **现成可用的 Docker 镜像**：每次发布时，都会自动构建并推送一个包含所有浏览器的 Docker 镜像。 ## 快速开始先快速看一眼 Scrapling 能做什么，而不必一开始就深入细节。 ### 基本用法带会话支持的 HTTP 请求 ```python from scrapling.fetchers import Fetcher, FetcherSession with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) quotes = page.css('.quote .text::text').getall() # Or use one-off requests page = Fetcher.get('https://quotes.toscrape.com/') quotes = page.css('.quote .text::text').getall() ``` 高级 stealth 模式 ```python from scrapling.fetchers import StealthyFetcher, StealthySession with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) data = page.css('#padded_content a').getall() # Or use one-off request style, it opens the browser for this request, then closes it after finishing page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') data = page.css('#padded_content a').getall() ``` 完整浏览器自动化 ```python from scrapling.fetchers import DynamicFetcher, DynamicSession with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish page = session.fetch('https://quotes.toscrape.com/', load_dom=False) data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer it # Or use one-off request style, it opens the browser for this request, then closes it after finishing page = DynamicFetcher.fetch('https://quotes.toscrape.com/') data = page.css('.quote .text::text').getall() ``` ### Spiders 使用并发请求、多种会话类型以及暂停/恢复功能构建完整爬虫： ```python from scrapling.spiders import Spider, Request, Response class QuotesSpider(Spider): name = "quotes" start_urls = ["https://quotes.toscrape.com/"] concurrent_requests = 10 async def parse(self, response: Response): for quote in response.css('.quote'): yield { "text": quote.css('.text::text').get(), "author": quote.css('.author::text').get(), } next_page = response.css('.next a') if next_page: yield response.follow(next_page[0].attrib['href']) result = QuotesSpider().start() print(f"Scraped {len(result.items)} quotes") result.items.to_json("quotes.json") ``` 在单个 spider 中使用多种会话类型： ```python from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): # Route protected pages through the stealth session if "protected" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast", callback=self.parse) # explicit callback ``` 像下面这样运行 spider，即可通过 checkpoint 暂停并恢复长时间爬取： ```python QuotesSpider(crawldir="./crawl_data").start() ``` 按 Ctrl+C 可优雅暂停，进度会自动保存。稍后再次启动 spider 时，传入相同的 `crawldir`，它就会从上次停止的位置继续。 ### 高级解析与导航 ```python from scrapling.fetchers import Fetcher # Rich element selection and navigation page = Fetcher.get('https://quotes.toscrape.com/') # Get quotes with multiple selection methods quotes = page.css('.quote') # CSS selector quotes = page.xpath('//div[@class="quote"]') # XPath quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style # Same as quotes = page.find_all('div', class_='quote') quotes = page.find_all(['div'], class_='quote') quotes = page.find_all(class_='quote') # and so on... # Find element by text content quotes = page.find_by_text('quote', tag='div') # Advanced navigation quote_text = page.css('.quote')[0].css('.text::text').get() quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors first_quote = page.css('.quote')[0] author = first_quote.next_sibling.css('.author::text') parent_container = first_quote.parent # Element relationships and similarity similar_elements = first_quote.find_similar() below_elements = first_quote.below_elements() ``` 如果你不想像下面这样抓取网站，也可以直接使用解析器： ```python from scrapling.parser import Selector page = Selector("...") ``` 而且它的工作方式完全相同！ ### 异步会话管理示例 ```python import asyncio from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns page1 = session.get('https://quotes.toscrape.com/') page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') # Async session usage async with AsyncStealthySession(max_pages=2) as session: tasks = [] urls = ['https://example.com/page1', 'https://example.com/page2'] for url in urls: task = session.fetch(url) tasks.append(task) print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error) results = await asyncio.gather(*tasks) print(session.get_pool_stats()) ``` ## CLI 与交互式 Shell Scrapling 包含一个强大的命令行界面： [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) 启动交互式 Web Scraping shell ```bash scrapling shell ``` 无需编程即可直接将页面提取到文件中（默认提取 `body` 标签内的内容）。如果输出文件以 `.txt` 结尾，则会提取目标的文本内容；如果以 `.md` 结尾，则会输出 HTML 内容的 Markdown 表示；如果以 `.html` 结尾，则输出 HTML 内容本身。 ```bash scrapling extract get 'https://example.com' content.md scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # All elements matching the CSS selector '#fromSkipToProducts' scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare ``` > [!NOTE] > 还有许多附加功能，但我们希望保持本页简洁，其中包括 MCP 服务器和交互式 Web Scraping Shell。完整文档请查看[这里](https://scrapling.readthedocs.io/en/latest/) ## 性能基准 Scrapling 不只是功能强大，它也快得惊人。下面的基准测试将 Scrapling 的解析器与其他流行库的最新版本进行了比较。 ### 文本提取速度测试（5000 个嵌套元素） | # | 库 | 耗时 (ms) | 相对 Scrapling | |---|:-----------------:|:---------:|:--------------:| | 1 | Scrapling | 2.02 | 1.0x | | 2 | Parsel/Scrapy | 2.04 | 1.01 | | 3 | Raw Lxml | 2.54 | 1.257 | | 4 | PyQuery | 24.17 | ~12x | | 5 | Selectolax | 82.63 | ~41x | | 6 | MechanicalSoup | 1549.71 | ~767.1x | | 7 | BS4 with Lxml | 1584.31 | ~784.3x | | 8 | BS4 with html5lib | 3391.91 | ~1679.1x | ### 元素相似度与文本搜索性能 Scrapling 的自适应元素查找能力显著优于替代方案： | Library | Time (ms) | vs Scrapling | |-------------|:---------:|:------------:| | Scrapling | 2.39 | 1.0x | | AutoScraper | 12.45 | 5.209x | > 所有基准均为 100+ 次运行的平均值。方法详见 [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)。 ## 安装 Scrapling 需要 Python 3.10 或更高版本： ```bash pip install scrapling ``` 此安装仅包含解析器引擎及其依赖，不包含任何 fetchers 或 commandline 依赖。 ### 可选依赖 1. 如果你打算使用下面的任何附加功能、fetchers 或它们的类，你需要先安装 fetchers 的依赖及其浏览器依赖，如下所示： ```bash pip install "scrapling[fetchers]" scrapling install # normal install scrapling install --force # force reinstall ``` 这会下载所有浏览器，以及它们的系统依赖和指纹操控依赖。或者你也可以在代码中安装，而不是运行命令，像这样： ```python from scrapling.cli import install install([], standalone_mode=False) # normal install install(["--force"], standalone_mode=False) # force reinstall ``` 2. 附加功能： - 安装 MCP 服务器功能： ```bash pip install "scrapling[ai]" ``` - 安装 shell 功能（Web Scraping shell 和 `extract` 命令）： ```bash pip install "scrapling[shell]" ``` - 安装全部内容： ```bash pip install "scrapling[all]" ``` 请记得，在安装任意这些附加功能之后（如果你还没装过），都需要通过 `scrapling install` 安装浏览器依赖 ### Docker 你也可以使用以下命令从 DockerHub 安装一个包含全部附加功能和浏览器的 Docker 镜像： ```bash docker pull pyd4vinci/scrapling ``` 或者从 GitHub registry 下载： ```bash docker pull ghcr.io/d4vinci/scrapling:latest ``` 这个镜像会通过 GitHub Actions 和仓库的 main 分支自动构建并推送。 ## 贡献欢迎贡献！开始前请先阅读我们的[贡献指南](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)。 ## 免责声明 > [!CAUTION] > 本库仅用于教育和研究目的。使用本库即表示你同意遵守本地及国际上的数据抓取和隐私法律。作者和贡献者不对本软件的任何滥用负责。请始终尊重网站的服务条款和 robots.txt 文件。 ## 🎓 引用如果你在研究中使用了我们的库，请使用以下参考文献引用我们： ```text @misc{scrapling, author = {Karim Shoair}, title = {Scrapling}, year = {2024}, url = {https://github.com/D4Vinci/Scrapling}, note = {An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!} } ``` ## 许可证本作品采用 BSD-3-Clause License 授权。 ## 致谢本项目包含改编自以下项目的代码： - Parsel (BSD License) — 用于 [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) 子模块 ---

由 Karim Shoair 设计并精心打造 ❤️