论坛数据爬虫 – 爬取论坛帖子和用户行为数据,进行分析

lichongyang

5 月前

下面给你一份 《论坛数据爬虫：从帖子到用户行为数据的完整采集与分析教学》，适用于 Python（Requests + BeautifulSoup / Selenium）+ 数据分析（Pandas + 可视化）。内容包含整体架构、反爬绕过、数据库建模与分析流程。你可以直接用在实际项目里。

✅ 论坛数据爬虫：爬取帖子 + 用户行为数据（超详细教程）

适用：Discuz / phpBB / 贴吧 / 简易论坛系统
技术栈：Python、Requests、BeautifulSoup、lxml、Selenium（可选）、Pandas、MySQL/MongoDB

🔥 一、项目目标

你将能实现如下功能：

✔ 1. 爬取论坛板块列表

板块名称
板块 URL
帖子数量
最新发帖时间

✔ 2. 爬取帖子列表（分页）

帖子标题
帖子 URL
发帖用户
回复数 / 浏览数
发布时间

✔ 3. 爬取帖子详情（多页）

每层回复内容
回复用户
回复时间
点赞数（如有）

✔ 4. 用户行为数据

用户个人资料
用户发帖数量
用户回复数量
用户活跃时间
用户行为路径（取决于论坛是否提供）

✔ 5. 数据落库

MySQL：结构化数据
MongoDB：半结构化文本内容

✔ 6. 分析内容

用户活跃度
高频用户
帖子热度
关键词提取（NLP）
情感分析（正面/负面）

🚀 二、项目整体架构设计

crawler/
 ├── main.py
 ├── config.py
 ├── forum_spider.py
 ├── detail_spider.py
 ├── user_spider.py
 ├── utils/
 │    ├── headers.py
 │    ├── proxy.py
 │    ├── db.py
 │    └── logger.py
 └── data/
      ├── posts.csv
      ├── replies.csv
      └── users.csv

🔧 三、准备工作

1. 安装依赖

pip install requests beautifulsoup4 lxml selenium pandas pymysql

2. 可选：使用 ChromeDriver

用于动态加载论坛（若有验证码、JS 分页）：

from selenium import webdriver

🕸 四、爬取流程（含代码示例）

步骤 1：爬取板块列表

import requests
from bs4 import BeautifulSoup

base_url = "https://example-forum.com"

def get_forum_sections():
    html = requests.get(base_url).text
    soup = BeautifulSoup(html, "lxml")

    sections = []
    for sec in soup.select(".forum-section"):
        sections.append({
            "name": sec.text.strip(),
            "url": base_url + sec.get("href")
        })
    return sections

步骤 2：爬取帖子列表（含分页）

def get_thread_list(section_url):
    threads = []

    page = 1
    while True:
        url = f"{section_url}&amp;page={page}"
        html = requests.get(url).text
        soup = BeautifulSoup(html, "lxml")

        items = soup.select(".thread-item")
        if not items:
            break

        for item in items:
            threads.append({
                "title": item.select_one(".title").text,
                "url": base_url + item.select_one(".title a").get("href"),
                "author": item.select_one(".author").text,
                "reply_count": item.select_one(".reply").text
            })

        page += 1

    return threads

步骤 3：爬取帖子详情（多页回复）

def get_thread_detail(thread_url):
    page = 1
    replies = []

    while True:
        url = f"{thread_url}&amp;page={page}"
        html = requests.get(url).text
        soup = BeautifulSoup(html, "lxml")

        reply_items = soup.select(".reply-item")
        if not reply_items:
            break

        for r in reply_items:
            replies.append({
                "user": r.select_one(".username").text,
                "content": r.select_one(".reply-content").text.strip(),
                "time": r.select_one(".reply-time").text
            })

        page += 1

    return replies

步骤 4：爬取用户数据

def get_user_info(user_url):
    html = requests.get(user_url).text
    soup = BeautifulSoup(html, "lxml")

    return {
        "name": soup.select_one(".username").text,
        "posts": soup.select_one(".post-count").text,
        "replies": soup.select_one(".reply-count").text,
        "register_time": soup.select_one(".register-time").text
    }

🗄 五、数据库表结构（MySQL）

user

字段	类型	说明
id	int	用户ID
username	varchar	昵称
posts	int	发帖数
replies	int	回复数

post

字段	类型	说明
id	int	帖子ID
title	varchar	标题
author	varchar	作者
reply_count	int	回复数

reply

字段	类型	说明
id	int	回复ID
post_id	int	所属帖子
user	varchar	用户名
content	text	内容
time	datetime	时间

📊 六、数据分析（Pandas）

1. 用户活跃度

import pandas as pd
users = pd.read_csv("users.csv")
print(users.sort_values("posts", ascending=False).head())

2. 帖子热度分析

posts = pd.read_csv("posts.csv")
posts["heat"] = posts["reply_count"] * 2
print(posts.sort_values("heat", ascending=False).head())

3. NLP 情感分析

（可用你项目中正在做的情感模型）

🛡 七、反爬措施与绕过策略

反爬手段	对策
User-Agent 限制	伪装 UA
Cookie 校验	保存 session
访问太快	time.sleep / 随机延迟
登录验证	Selenium + 自动登录
IP 限制	代理池

🎁 八、可直接运行的主程序（整合版）

from forum_spider import get_forum_sections, get_thread_list, get_thread_detail

if __name__ == "__main__":
    sections = get_forum_sections()

    for sec in sections:
        threads = get_thread_list(sec["url"])
        for t in threads:
            replies = get_thread_detail(t["url"])
            print("抓到帖子：", t["title"], "回复数：", len(replies))