首页 > 脚本专栏 > python > Python Scrapling网页采集

Python使用Scrapling进行网页采集的用法详解

2026-05-20 09:28:04 作者：枫叶v.

Scrapling是一个Python Web Scraping框架,支持静态网页、动态网页抓取及多页面爬虫,它结合了requests、BeautifulSoup、Scrapy、Playwright的优点,API简洁统一,适用于快速原型、简单的网页抓取,本文给大家介绍了Scrapling的使用指南,需要的朋友可以参考下

Scrapling 是一个 Python Web Scraping 框架，可以用来抓取静态网页、动态网页，也可以编写多页面爬虫。它的 API 风格有点像 requests + BeautifulSoup + Scrapy + Playwright 的结合体。

项目地址：D4Vinci/Scrapling
官方文档：Scrapling Docs

适合什么场景

Scrapling 主要适合这些任务：

抓取普通 HTML 页面
使用 CSS / XPath 提取网页内容
抓取 JavaScript 渲染后的页面
编写多页面爬虫
处理分页、详情页、表格、商品列表等结构化数据
使用浏览器模式处理更复杂的网站

简单来说：

普通网页：Fetcher
动态 网页：DynamicFetcher
复杂保护页面：StealthyFetcher
多页面爬虫：Spider

安装

Scrapling 要求 Python 3.10+。

基础安装：

pip install scrapling

如果需要抓取 JavaScript 渲染页面，需要安装 fetchers 依赖：

pip install "scrapling[fetchers]"
scrapling install

Demo 1：抓取普通网页

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://quotes.toscrape.com/")

for quote in page.css(".quote"):
    text = quote.css(".text::text").get()
    author = quote.css(".author::text").get()

    print({
        "text": text,
        "author": author,
    })

这里的选择器和 Scrapy 很像：

page.css("h1::text").get()
page.css("a::attr(href)").getall()
page.xpath("//h1/text()").get()

Demo 2：抓取分页

from scrapling.fetchers import Fetcher

url = "https://quotes.toscrape.com/"

while url:
    page = Fetcher.get(url)

    for quote in page.css(".quote"):
        print({
            "text": quote.css(".text::text").get(),
            "author": quote.css(".author::text").get(),
        })

    next_href = page.css(".next a::attr(href)").get()
    url = page.urljoin(next_href) if next_href else None

page.urljoin() 可以把相对链接转换成完整 URL。

Demo 3：抓取商品列表

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://books.toscrape.com/")

books = []

for item in page.css("article.product_pod"):
    books.append({
        "title": item.css("h3 a::attr(title)").get(),
        "price": item.css(".price_color::text").get(),
        "stock": item.css(".availability::text").getall()[-1].strip(),
        "url": page.urljoin(item.css("h3 a::attr(href)").get()),
    })

print(books)

Demo 4：用文本和正则查找元素

from scrapling.fetchers import Fetcher

page = Fetcher.get("https://books.toscrape.com/index.html")

book = page.find_by_text("Tipping the Velvet")
print(book.text)
print(page.urljoin(book.attrib["href"]))

price = page.find_by_regex(r"£[\d\.]+")
print(price.text)

这类 API 适合快速定位页面里的文字内容，不一定每次都要手写复杂 CSS 选择器。

Demo 5：抓取 JavaScript 动态页面

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://quotes.toscrape.com/js/",
    headless=True,
    network_idle=True,
)

for quote in page.css(".quote"):
    print({
        "text": quote.css(".text::text").get(),
        "author": quote.css(".author::text").get(),
    })

如果页面内容是前端 JS 渲染出来的，普通 Fetcher 可能抓不到，这时可以使用 DynamicFetcher。

Demo 6：等待元素出现

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(
    "https://quotes.toscrape.com/js-delayed/",
    headless=True,
    wait_selector=".quote",
    wait_selector_state="visible",
)

quotes = page.css(".quote .text::text").getall()
print(quotes)

这个写法适合页面加载较慢、需要等待某个元素出现的情况。

Demo 7：写一个 Spider

如果要做正式爬虫，推荐用 Spider。

from scrapling.spiders import Spider, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 5

    async def parse(self, response: Response):
        for quote in response.css(".quote"):
            yield {
                "text": quote.css(".text::text").get(""),
                "author": quote.css(".author::text").get(""),
                "tags": quote.css(".tag::text").getall(),
            }

        next_page = response.css(".next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

result = QuotesSpider().start()

print(f"抓到 {len(result.items)} 条")
result.items.to_json("quotes.json")

Spider 的结构比较清晰：

name：爬虫名称
start_urls：起始页面
parse：解析响应，yield 数据或新请求
response.follow：继续跟进下一页

Fetcher、DynamicFetcher、Spider 怎么选

场景	推荐
普通 HTML 页面	`Fetcher`
需要 cookie / session	`FetcherSession`
JS 渲染页面	`DynamicFetcher`
需要浏览器操作	`DynamicFetcher + page_action`
多页面爬虫	`Spider`
更复杂的反爬页面	`StealthyFetcher`

小结

Scrapling 的优点是 API 比较统一：无论是普通请求、动态页面，还是 Spider 爬虫，最终拿到的页面对象都可以用类似的方式解析：

page.css(...)
page.xpath(...)
page.find_by_text(...)
page.find_by_regex(...)

如果你之前用过 requests、BeautifulSoup、Scrapy 或 Playwright，Scrapling 上手会比较快。

它适合从小脚本逐步升级到正式爬虫项目：一开始可以用 Fetcher.get() 写简单 demo，后面再改成 Spider 做分页、并发和数据导出。

注意事项

使用 Scrapling 抓取网站时，需要遵守目标网站的服务条款、robots.txt 和相关法律法规。建议优先抓取公开、允许访问的数据，并控制请求频率，避免对目标网站造成压力。

以上就是Python使用Scrapling进行网页采集的用法详解的详细内容，更多关于Python Scrapling网页采集的资料请关注脚本之家其它相关文章！