非常好 👍 我们今天来讲 Python 网络爬虫（第9课）——多任务爬虫的同步与异步（协程）实现。

本章内容非常重要，它决定了你的爬虫性能是否能“飞起来”。

🕷️ Python Web Crawler（9）——多任务同步、异步（协程）

一、基本概念回顾

类型	特点	代表模块	优点	缺点
同步（单线程）	任务一个接一个执行	`requests`	简单易懂	速度慢
多线程	多个线程并发执行任务	`threading`	提高并发	GIL 限制 CPU 并行
多进程	多个进程并行执行	`multiprocessing`	可利用多核CPU	内存开销大
异步IO（协程）	单线程内并发执行I/O任务	`asyncio`、`aiohttp`	超高并发、轻量级	编程复杂度高

二、同步爬虫（示例）

import requests
import time

urls = [
    'https://example.com',
    'https://www.python.org',
    'https://www.wikipedia.org'
]

def fetch(url):
    print(f"正在下载：{url}")
    response = requests.get(url)
    print(f"完成：{url} -> {len(response.text)} 字节")

start = time.time()

for url in urls:
    fetch(url)

print(f"同步爬取耗时：{time.time() - start:.2f} 秒")

每次下载都要等待上一个完成 —— 速度慢。

三、多线程爬虫

import requests
import threading
import time

urls = [
    'https://example.com',
    'https://www.python.org',
    'https://www.wikipedia.org'
]

def fetch(url):
    print(f"开始下载：{url}")
    response = requests.get(url)
    print(f"完成：{url} -> {len(response.text)} 字节")

start = time.time()

threads = []
for url in urls:
    t = threading.Thread(target=fetch, args=(url,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print(f"多线程爬取耗时：{time.time() - start:.2f} 秒")

✅ 多线程可以显著加快网络I/O类任务。

四、异步协程爬虫（asyncio + aiohttp）

异步是现代高并发爬虫的核心。

import aiohttp
import asyncio
import time

urls = [
    'https://example.com',
    'https://www.python.org',
    'https://www.wikipedia.org'
]

async def fetch(session, url):
    print(f"开始下载：{url}")
    async with session.get(url) as resp:
        text = await resp.text()
        print(f"完成：{url} -> {len(text)} 字节")

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

start = time.time()
asyncio.run(main())
print(f"异步爬取耗时：{time.time() - start:.2f} 秒")

✅ 特点：

所有任务同时发出请求；
I/O 等待期间 CPU 空闲，不浪费；
一般比多线程还快（尤其上千个请求时）。

五、混合型爬虫架构建议

层	技术	说明
调度层	`asyncio`	控制协程任务数量
下载层	`aiohttp`	异步网络请求
解析层	`BeautifulSoup` / `lxml`	解析 HTML
存储层	`asyncpg` / `aiomysql`	异步数据库插入
控制层	`asyncio.Queue`	任务队列控制

六、并发控制示例（限制同时连接数）

import aiohttp
import asyncio
import time

semaphore = asyncio.Semaphore(5)  # 最多5个并发

urls = [f"https://example.com/page{i}" for i in range(20)]

async def fetch(session, url):
    async with semaphore:
        print(f"下载：{url}")
        async with session.get(url) as resp:
            await asyncio.sleep(0.2)
            print(f"完成：{url}, 状态码={resp.status}")

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

start = time.time()
asyncio.run(main())
print(f"异步爬取耗时：{time.time() - start:.2f} 秒")

⚙️ asyncio.Semaphore 用于限制并发连接数，防止被网站封禁。

七、总结对比

模式	并发能力	CPU占用	编程复杂度	适用场景
同步	❌	低	⭐	小规模爬取
多线程	✅	中	⭐⭐	中等规模爬取
异步（协程）	🚀🚀	低	⭐⭐⭐⭐	大规模高并发爬虫

八、实战建议

批量下载网页或API数据 → 优先用 aiohttp + asyncio
需要数据计算或解析耗时长 → 可结合 asyncio + ThreadPoolExecutor
网站反爬强 → 适当降低并发 + 增加随机延迟
写数据库、写文件 → 使用异步I/O接口（例如 aiofiles）

一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Python web crawler（9）多任务同步、异步（协程）

🕷️ Python Web Crawler（9）——多任务同步、异步（协程）

一、基本概念回顾

二、同步爬虫（示例）

三、多线程爬虫

四、异步协程爬虫（asyncio + aiohttp）

五、混合型爬虫架构建议

六、并发控制示例（限制同时连接数）

七、总结对比

八、实战建议

lichongyang

发表回复取消回复

Python web crawler（9）多任务同步、异步（协程）

🕷️ Python Web Crawler（9）——多任务同步、异步（协程）

一、基本概念回顾

二、同步爬虫（示例）

三、多线程爬虫

四、异步协程爬虫（asyncio + aiohttp）

五、混合型爬虫架构建议

六、并发控制示例（限制同时连接数）

七、总结对比

八、实战建议

lichongyang

发表回复 取消回复

发表回复取消回复