crawlee-python

<h1 align="center"> <a href="https://crawlee.dev"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/apify/crawlee-python/master/website/static/img/crawlee-dark.svg?sanitize=true"> <img alt="Crawlee" src="https://yellow-cdn.veclightyear.com/835a84d5/9008a41e-2308-4c1a-a20b-744cec5aec14.svg?sanitize=true" width="500"> </picture> </a> <br> <small>一个网络爬虫和浏览器自动化库</small> </h1> <p align=center> <a href="https://badge.fury.io/py/crawlee" rel="nofollow"> <img src="https://yellow-cdn.veclightyear.com/835a84d5/3e9a7e7b-c331-4852-8f62-0fa8b4b98bb6.svg" alt="PyPI版本" style="max-width: 100%;"> </a> <a href="https://pypi.org/project/crawlee/" rel="nofollow"> <img src="https://img.shields.io/pypi/dm/crawlee" alt="PyPI - 下载量" style="max-width: 100%;"> </a> <a href="https://pypi.org/project/crawlee/" rel="nofollow"> <img src="https://img.shields.io/pypi/pyversions/crawlee" alt="PyPI - Python版本" style="max-width: 100%;"> </a> <a href="https://discord.gg/jyEM2PRvMU" rel="nofollow"> <img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="在Discord上聊天" style="max-width: 100%;"> </a> </p> <h1 align="center"> <a href="https://apify.com/resources/crawlee-for-python-webinar"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://pbs.twimg.com/card_img/1817941805385560065/U9LeYTKD?format=png&name=small"> <img alt="Crawlee" src="https://pbs.twimg.com/card_img/1817941805385560065/U9LeYTKD?format=png&name=small" width="500"> </picture> </a> <br> <small> 从Crawlee的创作者那里了解更多关于Python版Crawlee的信息。8月5日上午9点（美国东部时间）加入我们。<a href="https://apify.com/resources/crawlee-for-python-webinar">立即报名！</a></small> </h1>

Crawlee涵盖了您的爬虫和抓取的全过程，并且帮助您快速构建可靠的爬虫。

🚀 Python版Crawlee现已开放给早期采用者！

即使使用默认配置，您的爬虫也会表现得几乎像人类一样，轻松躲过现代的机器人保护。Crawlee为您提供了爬取网页链接、抓取数据并以机器可读格式持久存储的工具，无需担心技术细节。如果默认设置不能满足您的需求，丰富的配置选项可以让您调整Crawlee的几乎任何方面，以适应您项目的需求。

👉 在Crawlee项目网站查看完整文档、指南和示例 👈

我们还有一个TypeScript版的Crawlee实现，您可以探索并用于您的项目。访问我们的GitHub仓库以获取更多信息GitHub上的JS/TS版Crawlee。

安装

我们建议访问Crawlee文档中的入门教程以获取更多信息。

Crawlee作为crawlee PyPI包提供。核心功能包含在基础包中，额外功能作为可选扩展提供，以最小化包大小和依赖。要安装包含所有功能的Crawlee，请运行以下命令：

pip install 'crawlee[all]'

然后，安装Playwright依赖：

playwright install

验证Crawlee是否成功安装：

python -c 'import crawlee; print(crawlee.__version__)'

有关详细的安装说明，请参阅设置文档页面。

使用Crawlee CLI

使用Crawlee的最快方法是使用Crawlee CLI并选择一个预先准备好的模板。首先，确保您已安装Pipx：

pipx --help

然后，运行CLI并从可用模板中选择：

pipx run crawlee create my-crawler

如果您已经安装了crawlee，可以通过运行以下命令启动它：

crawlee create my-crawler

示例

以下是一些实际示例，帮助您开始使用Crawlee中的不同类型的爬虫。每个示例演示了如何为特定用例设置和运行爬虫，无论您是需要处理简单的HTML页面还是与JavaScript密集型网站交互。爬虫运行将在您的当前工作目录中创建一个storage/目录。

BeautifulSoupCrawler

BeautifulSoupCrawler使用HTTP库下载网页，并向用户提供已解析的HTML内容。它使用HTTPX进行HTTP通信，使用BeautifulSoup解析HTML。它非常适合需要从HTML内容中高效提取数据的项目。由于不使用浏览器，这种爬虫的性能非常好。但是，如果您需要执行客户端JavaScript来获取内容，这种方法就不够用了，您需要使用PlaywrightCrawler。此外，如果您想使用这个爬虫，请确保安装crawlee时包含了beautifulsoup扩展。

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
    crawler = BeautifulSoupCrawler(
        # 限制每次爬取的最大请求数。移除或增加此限制以爬取所有链接。
        max_requests_per_crawl=10,
    )

    # 定义默认请求处理程序，该程序将处理每个请求。
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'正在处理 {context.request.url} ...')

        # 从页面提取数据。
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }

        # 将提取的数据推送到默认数据集。
        await context.push_data(data)

        # 将页面上找到的所有链接加入队列。
        await context.enqueue_links()

    # 使用初始URL列表运行爬虫。
    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

### PlaywrightCrawler

PlaywrightCrawler 使用无头浏览器下载网页并提供数据提取的API。它基于 Playwright 构建，Playwright 是一个专为管理无头浏览器设计的自动化库。它擅长获取依赖客户端 JavaScript 生成内容的网页，或需要与 JavaScript 驱动的内容交互的任务。对于不需要执行 JavaScript 或需要更高性能的场景，可以考虑使用 BeautifulSoupCrawler。另外，如果你想使用这个爬虫，请确保安装带有 playwright 额外依赖的 crawlee。

```python
import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
    crawler = PlaywrightCrawler(
        # 限制每次爬取的最大请求数。移除或增加此限制以爬取所有链接。
        max_requests_per_crawl=10,
    )

    # 定义默认请求处理程序，该程序将处理每个请求。
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'正在处理 {context.request.url} ...')

        # 从页面提取数据。
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }

        # 将提取的数据推送到默认数据集。
        await context.push_data(data)

        # 将页面上找到的所有链接加入队列。
        await context.enqueue_links()

    # 使用初始请求列表运行爬虫。
    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())