awesome-crawler

awesome-crawler

多语言网络爬虫框架和工具大全

awesome-crawler项目汇集了各种编程语言的网络爬虫框架和工具。涵盖Python、Java、C#、JavaScript、PHP等主流语言,包含Scrapy、Nutch、crawler4j等知名框架,以及HTML解析、分布式爬取等功能库。项目为网络数据采集开发者提供了全面的技术参考,有助于选择合适的爬虫解决方案。

网络爬虫Web CrawlerScrapy开源项目多语言支持Github

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different languages.

Contents

Python

  • Scrapy - A fast high-level screen scraping and web crawling framework.
  • pyspider - A powerful spider system.
  • CoCrawler - A versatile web crawler built using modern tools and concurrency.
  • cola - A distributed crawling framework.
  • Demiurge - PyQuery-based scraping micro-framework.
  • Scrapely - A pure-python HTML screen-scraping library.
  • feedparser - Universal feed parser.
  • you-get - Dumb downloader that scrapes the web.
  • MechanicalSoup - A Python library for automating interaction with websites.
  • portia - Visual scraping for Scrapy.
  • crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
  • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
  • MSpider - A simple ,easy spider using gevent and js render.
  • brownant - A lightweight web data extracting framework.
  • PSpider - A simple spider frame in Python3.
  • Gain - Web crawling framework based on asyncio for everyone.
  • sukhoi - Minimalist and powerful Web Crawler.
  • spidy - The simple, easy to use command line web crawler.
  • newspaper - News, full-text, and article metadata extraction in Python 3
  • aspider - An async web scraping micro-framework based on asyncio.

Java

  • ACHE Crawler - An easy to use web crawler for domain-specific search.
  • Apache Nutch - Highly extensible, highly scalable web crawler for production environment.
    • anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
  • Crawler4j - Simple and lightweight web crawler.
  • JSoup - Scrapes, parses, manipulates and cleans HTML.
  • websphinx - Website-Specific Processors for HTML information extraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • Gecco - A easy to use lightweight web crawler
  • WebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • Webmagic - A scalable crawler framework.
  • Spiderman - A scalable ,extensible, multi-threaded web crawler.
    • Spiderman2 - A distributed web crawler framework,support js render.
  • Heritrix3 - Extensible, web-scale, archival-quality web crawler project.
  • SeimiCrawler - An agile, distributed crawler framework.
  • StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
  • Spark-Crawler - Evolving Apache Nutch to run on Spark.
  • webBee - A DFS web spider.
  • spider-flow - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
  • Norconex Web Crawler - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
  • SimpleCrawler - Simple spider base on mutithreading, regluar expression.
  • DotnetSpider - This is a cross platfrom, ligth spider develop by C#.
  • Abot - C# web crawler built for speed and flexibility.
  • Hawk - Advanced Crawler and ETL tool written in C#/WPF.
  • SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
  • Infinity Crawler - A simple but powerful web crawler library in C#.

JavaScript

  • scraperjs - A complete and versatile web scraper.
  • scrape-it - A Node.js scraper for humans.
  • simplecrawler - Event driven web crawler.
  • node-crawler - Node-crawler has clean,simple api.
  • js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.
  • webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
  • x-ray - Web scraper with pagination and crawler support.
  • node-osmosis - HTML/XML parser and web scraper for Node.js.
  • web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.
  • supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
  • headless-chrome-crawler - Headless Chrome crawls with jQuery support
  • Squidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
  • crawlee - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.

PHP

  • Goutte - A screen scraping and web crawling library for PHP.
  • dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • QueryList - The progressive PHP crawler framework.
  • pspider - Parallel web crawler written in PHP.
  • php-spider - A configurable and extensible PHP web spider.
  • spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
  • crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.
  • PHPScraper - PHPScraper is a scraper & crawler built for simplicity.

C++

C

  • httrack - Copy websites to your computer.

Ruby

  • Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
  • upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.
  • Spidr - Spider a site, multiple domains, certain links or infinitely.
  • Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.
  • mechanize - Automated web interaction & crawling.

Rust

  • spider - The fastest web crawler and indexer.
  • crawler - A gRPC web indexer turbo charged for performance.

R

  • rvest - Simple web scraping for R.

Erlang

  • ebot - A scalable, distribuited and highly configurable web cawler.

Perl

  • web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

  • pholcus - A distributed, high concurrency and powerful web crawler.
  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
  • go_spider - An awesome Go concurrent Crawler(spider) framework.
  • dht - BitTorrent DHT Protocol && DHT Spider.
  • ants-go - A open source, distributed, restful crawler engine in golang.
  • scrape - A simple, higher level interface for Go web scraping.
  • creeper - The Next Generation Crawler Framework (Go).
  • colly - Fast and Elegant Scraping Framework for Gophers.
  • ferret - Declarative web scraping.
  • Dataflow kit - Extract structured data from web pages. Web sites scraping.
  • Hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application

Scala

  • crawler - Scala DSL for web crawling.
  • scrala - Scala crawler(spider) framework, inspired by scrapy.
  • ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and

编辑推荐精选

扣子-AI办公

扣子-AI办公

职场AI,就用扣子

AI办公助手,复杂任务高效处理。办公效率低?扣子空间AI助手支持播客生成、PPT制作、网页开发及报告写作,覆盖科研、商业、舆情等领域的专家Agent 7x24小时响应,生活工作无缝切换,提升50%效率!

堆友

堆友

多风格AI绘画神器

堆友平台由阿里巴巴设计团队创建,作为一款AI驱动的设计工具,专为设计师提供一站式增长服务。功能覆盖海量3D素材、AI绘画、实时渲染以及专业抠图,显著提升设计品质和效率。平台不仅提供工具,还是一个促进创意交流和个人发展的空间,界面友好,适合所有级别的设计师和创意工作者。

图像生成热门AI工具AI图像AI反应堆AI工具箱AI绘画GOAI艺术字堆友相机
码上飞

码上飞

零代码AI应用开发平台

零代码AI应用开发平台,用户只需一句话简单描述需求,AI能自动生成小程序、APP或H5网页应用,无需编写代码。

Vora

Vora

免费创建高清无水印Sora视频

Vora是一个免费创建高清无水印Sora视频的AI工具

Refly.AI

Refly.AI

最适合小白的AI自动化工作流平台

无需编码,轻松生成可复用、可变现的AI自动化工作流

酷表ChatExcel

酷表ChatExcel

大模型驱动的Excel数据处理工具

基于大模型交互的表格处理系统,允许用户通过对话方式完成数据整理和可视化分析。系统采用机器学习算法解析用户指令,自动执行排序、公式计算和数据透视等操作,支持多种文件格式导入导出。数据处理响应速度保持在0.8秒以内,支持超过100万行数据的即时分析。

AI工具使用教程AI营销产品酷表ChatExcelAI智能客服
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

热门AI工具生产力协作转型TraeAI IDE
AIWritePaper论文写作

AIWritePaper论文写作

AI论文写作指导平台

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

数据安全AI助手热门AI工具AI辅助写作AI论文工具论文写作智能生成大纲
博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

热门AI工具AI办公办公工具智能排版AI生成PPT博思AIPPT海量精品模板AI创作
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

下拉加载更多