想把我最近做的一个分布式网络爬虫系统开源,不知道有人有兴趣不 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
guoguobaba
V2EX    程序员

想把我最近做的一个分布式网络爬虫系统开源,不知道有人有兴趣不

  •  
  •   guoguobaba 18 小时 33 分钟前 600 次点击

    WebRPA: 分布式网络爬虫

    前言

    webrpa 是一个分布式的网络爬虫系统,基于 fastapi+fastadmin 开发,通过 web api 接口发起网络爬虫服务,实现流程自动化或数据自动抓取。它包含两部分:

    • manager: 用于提供 web 服务,实现了 web server 和 websocket server 。用户通过 web api 发送请求,manager 会将查询请求通过 websocket 转发给对应的 worker ,由 worker 执行查询操作。
    • worker: 通过 websocket 和 manager 连接,用于执行网络爬虫任务,接收 manager 的 websocket 请求,并执行查询操作。worker 可以跨网络分布式部署。
    graph LR client-->manager-->worker1 manager-->worker2 manager-->workers[worker...] 

    主要实现的功能包括:

    • 爬虫流程自定义:通过 json 指定访问网页的各种操作,实现流程自动化,并支持通过 markdown flowchart 语法定义流程,实现在不同条件下的跳转
    • 数据自动抓取:支持通过 js 对网页数据进行处理,获取结构化的数据以及对页面某个区域进行自动截图
    • 跨网络分布式部署:worker 可以跨网络分布式部署,实现负载均衡,并可以根据查询类型不同指定专用的 worker 。
    • 代理池支持:支持自定义代理池
    • undetected chromedriver:支持无头浏览器以及绕开网站的反扒检测。支持通过 selenium 和 requests 组合查询
    • 自定义 captcha 引擎:支持对图形验证码或者其他验证码的自动识别和点击操作等操作。并增加随机扰动,避免 antibot 检测。
    • 缓存和自动重试:支持缓存和自动重试,查询失败的请求会自动在闲时重试。
    • 自动扩容:支持 k8s 部署,一键扩容。
    • 权限和审计:支持数据源权限模型,不同的用户对不同的数据源具备不同的权限,并提供审计数据。

    TODO

    引入 browser use ,通过 LLM 自动创建数据爬虫服务。

    一个示例,爬取某网站

    { "name": "szreorc", "desc": "深圳不动产查询", "driver": "firefox", "url": "", "debug": true, "window_size": "1920x1080", "action_timeout": 5, "wait_redirect": true, "wait_redirect_interval": 2, "identifier": "{username}-{BuildingName}-{UNIT_NO}", "credential": "{username}", "actions": { "1": { "desc": "确认登录", "action": "check_variable", "options": {"script": "return window.location.href;", "target": "^https://pnr.sz.gov.cn/d-ghrer/reroosp/ytcf" } }, "10": { "desc" : "用户名密码登录", "action": "click", "timeout": 2, "target": ["xpath", "//a[contains(@class, 'login-tab') and normalize-space(text())='账号密码']"] }, "11": { "desc" : "输入用户名", "action": "input_text", "target": ["xpath", "//input[@type='text' and @placeholder='请输入账号']"], "param": "username" }, "12": { "desc": "增加计数", "action": "variable", "options": {"variable":"counter1","operator": "+"} }, "13": { "desc": "检测计数", "action": "variable", "stop_on_fail": true, "options": {"variable":"counter1","operator": "<", "target": 2, "sleep": 2000} }, "14": { "desc" : "输入密码", "action": "input_text", "target": ["xpath", "//input[@type='password' and @placeholder='请输入密码']"], "param": "password" }, "15": { "desc": "识别 captcha", "action": "decode_captcha_code", "target": ["xpath","//div[contains(@class, 'captcha-body') and @title='点击刷新']"], "options": {"code_type": 11} }, "16": { "desc": "输入 captcha", "action": "input_text", "target": ["xpath","//div[contains(@class, 'account_verifying')] //input[@type='text']"] }, "17": { "desc": "点击登录", "action": "click", "target": ["xpath", "//button[contains(@class, 'gd-btn-primary') and contains(@class, 'gd-btn') and @type='button']//span[starts-with(text(), '登录 ')]"] }, "18": { "desc": "继续登录", "action": "click", "target": ["xpath", "//button[.//span[contains(text(), '继续登录')]]"] }, "20": { "desc": "确认选择", "action": "click", "timeout": 10, "stop_on_fail": true, "fail_message": "login failed", "options": {"set_credential": true}, "target": ["class name", "jinruxuzhi-checkbox"] }, "21": { "desc": "确认选择下一步", "action": "click", "target": ["class name", "jinruxuzhi-buttonOk"] }, "30": { "desc": "展开查询类型", "action": "click", "options": {"sleep": 2}, "target": ["xpath", "//input[@type='text' and @placeholder='请选择']"] }, "31": { "desc": "等待下拉菜单", "action": "wait_element", "options": {"visible": true}, "target": ["css selector", "div.el-select-dropdown.el-popper"] }, "32": { "desc": "选择查询类型", "action": "click", "target": ["xpath", "//li[contains(@class, 'el-select-dropdown__item') and span[text()='楼名及栋名']]"] }, "33": { "desc" : "输入查询内容", "action": "input_text", "target": ["xpath", "//input[@type='text' and @placeholder='请输入内容']"], "param": "BuildingName" }, "34": { "desc": "点击查询", "action": "click", "target": ["class name", "el-icon-search"] }, "35": { "desc": "点击截图对象", "action": "click", "timeout": 20, "stop_on_fail": true, "fail_message": "search failed", "target": ["xpath", "//div[contains(@class, 'el-dialog__wrapper')]//div[contains(@class, 'el-tabs__item') and normalize-space(text())='楼宇']"] }, "40": { "desc": "获取数据", "action": "get_data", "options": {"script": "var table = document.querySelector(\"#pane-1 table.is-bordered.el-descriptions--mini\");\nvar fields = [\"土地坐落\", \"楼名及栋名\", \"房屋类型\", \"房屋性质\", \"房屋用途\"];\nvar result = {};\nif (table) {\n var rows = table.querySelectorAll(\"tr.el-descriptions-row\");\n rows.forEach(function(row) {\n var label = row.querySelector(\"th.el-descriptions-item__label\").innerText.trim();\n var cOntent= row.querySelector(\"td.el-descriptions-item__content\").innerText.trim();\n if (fields.includes(label)) {\n result[label] = content;\n }\n });\n console.log(JSON.stringify(result));\n} else {\n console.log(\"Table not found.\");\n};\nreturn result;\n"} }, "41": { "desc": "点击截图对象", "action": "click", "target": ["xpath", "//div[contains(@class, 'el-dialog__wrapper')]//div[contains(@class, 'el-tabs__item') and normalize-space(text())='房屋']"] }, "42": { "desc": "下拉房屋查询", "action": "click", "target": ["css selector", "#pane-2 input.el-input__inner"] }, "43": { "desc": "点击房屋查询", "action": "click", "target": ["xpath", "//li[contains(@class, 'el-select-dropdown__item')]//span[text()='{UNIT_NO}']"], "param": "UNIT_NO" }, "44": { "desc": "截图", "action": "screenshot", "target": ["class name", "el-dialog__wrapper"], "options": {"visible": true} }, "45": { "desc": "获取数据", "action": "get_data", "options": {"script": "var table = document.querySelector(\"#pane-2 table.is-bordered.el-descriptions--mini\");\nvar fields = [\"房号\", \"所在楼层\", \"建筑面积\", \"使用年限\", \"存在抵押\", \"存在查封\", \"存在异议\", \"存在居住权\"];\nvar result = {};\nif (table) {\n var rows = table.querySelectorAll(\"tr.el-descriptions-row\");\n rows.forEach(function(row) {\n var label = row.querySelector(\"th.el-descriptions-item__label\").innerText.trim();\n var cOntent= row.querySelector(\"td.el-descriptions-item__content\").innerText.trim();\n if (fields.includes(label)) {\n result[label] = content;\n }\n });\n console.log(JSON.stringify(result));\n} else {\n console.log(\"Table not found.\");\n};\nreturn result;\n"} } }, "processes": "start->1\n1(no)->10->11\n11(no)->12->13\n13(yes)->10\n11(yes)->14->15->16->17->18->20->21->30->31->32->33->34->35->40->41->42->43->44->45->end\n1(yes)->20", "result":["screenshot", "data"] } 
    1 条回复    2025-11-07 16:50:44 +08:00
    hamwong
        1
    hamwong  
       18 小时 24 分钟前
    滑动等更复杂的人机校验是怎么处理的
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2789 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 21ms UTC 03:15 PVG 11:15 LAX 19:15 JFK 22:15
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86