spider-flow 框架如何实现一个爬虫连续爬取多个同类型网页? - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
tiRolin
V2EX    Java

spider-flow 框架如何实现一个爬虫连续爬取多个同类型网页?

  •  
  •   tiRolin 2023-08-07 16:00:44 +08:00 1296 次点击
    这是一个创建于 841 天前的主题,其中的信息可能已经有所发展或是发生改变。

    我需要使用 spider-flow 框架爬取下面这三个网站的内容 https://price.21food.cn/product/939.html https://price.21food.cn/product/1505.html https://price.21food.cn/product/196.html

    这三个网址中我已经实现了其中一个网址的爬虫,由于这三个网址只是数据不同,所以这三个网址的数据其实可以放到一个爬虫里实现,之前我在 Selenium 框架中我是直接构建一个 url 集合用 for 循环解决的,但是在 spider-flow 中却难以实现

    我的想法是先定义一个 url 集合,然后建立循环爬取,所以我构建了如下所示的内容

    spider-flow

    第一个定义变量的内容是 urlList,定义了三个地址的集合["https://price.21food.cn/product/939.html","https://price.21food.cn/product/1505.html","https://price.21food.cn/product/196.html"]

    第二个是循环,顶一个 urlIndex 的下标,次数为 urlList

    第三个变量定义了 url 变量,值为${urlList[urlIndex]},其实就是获取前面集合中的具体 url

    第四个开始爬取使用的 url 指定为前面的 url ,值为${url}

    后面都是爬取数据爬虫逻辑,后面的内容是完全可用的,我之前已经试过了,这样构造我看着感觉没问题,但是时间运行之后的结果就是在第一个定义变量定义完之后就结束了

    我去网上搜索了很多教程,但是关于这个需求怎么实现的是找不到相关教程和案例,这个官网的文档我还不知道为什么打不开,我是实在没办法了,所以我来请教各位,各位有懂的还希望能不吝赐教,小弟在这里先谢过了

    spider-flow 框架的码云地址: https://gitee.com/ssssssss-team/spider-flow

    下载项目然后用 idea 打开,在数据库中运行项目提供 db.sql 并指定配置文件中数据库的地址就可以正确运行了,默认访问地址是 localhost:8088

    下面是我的构建的爬虫的内容,各位只要将该内容粘贴到 spider-flow 中即可运行,具体点击 XML 编辑的选项

    <mxGraphModel> <root> <mxCell id="0"> <JsonProperty as="data"> {&quot;spiderName&quot;:&quot;食品商务网爬虫(未整合多个网址)&quot;,&quot;submit-strategy&quot;:&quot;random&quot;,&quot;threadCount&quot;:&quot;&quot;} </JsonProperty> </mxCell> <mxCell id="1" parent="0"/> <mxCell id="2" value="开始" style="start" parent="1" vertex="1"> <mxGeometry x="300" y="80" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;shape&quot;:&quot;start&quot;} </JsonProperty> </mxCell> <mxCell id="3" value="开始抓取" style="request" parent="1" vertex="1"> <mxGeometry x="490" y="80" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;开始抓取&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;method&quot;:&quot;GET&quot;,&quot;sleep&quot;:&quot;&quot;,&quot;timeout&quot;:&quot;&quot;,&quot;response-charset&quot;:&quot;&quot;,&quot;retryCount&quot;:&quot;&quot;,&quot;retryInterval&quot;:&quot;&quot;,&quot;body-type&quot;:&quot;none&quot;,&quot;body-content-type&quot;:&quot;text/plain&quot;,&quot;loopCount&quot;:&quot;&quot;,&quot;url&quot;:&quot;${url}&quot;,&quot;proxy&quot;:&quot;&quot;,&quot;request-body&quot;:&quot;&quot;,&quot;follow-redirect&quot;:&quot;1&quot;,&quot;tls-validate&quot;:&quot;1&quot;,&quot;cookie-auto-set&quot;:&quot;1&quot;,&quot;repeat-enable&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;request&quot;} </JsonProperty> </mxCell> <mxCell id="4" value="定义变量" style="variable" parent="1" vertex="1"> <mxGeometry x="620" y="80" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;dataList&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${extract.xpaths(resp.html,&#39;/html/body/div[2]/div[3]/div/div[2]/div[1]/div[2]/div[2]/ul/li&#39;)}&quot;],&quot;shape&quot;:&quot;variable&quot;} </JsonProperty> </mxCell> <mxCell id="9" value="" style="strokeWidth=2;sharp=1;" parent="1" source="3" target="4" edge="1"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="11" value="循环" style="loop" parent="1" vertex="1"> <mxGeometry x="620" y="170" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;循环&quot;,&quot;loopItem&quot;:&quot;&quot;,&quot;loopVariableName&quot;:&quot;index&quot;,&quot;loopCount&quot;:&quot;${list.length(dataList)}&quot;,&quot;loopStart&quot;:&quot;0&quot;,&quot;loopEnd&quot;:&quot;-1&quot;,&quot;shape&quot;:&quot;loop&quot;} </JsonProperty> </mxCell> <mxCell id="12" value="" style="strokeWidth=2;sharp=1;" parent="1" source="4" target="11" edge="1"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="13" value="输出" style="output" parent="1" vertex="1"> <mxGeometry x="790" y="334" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;输出&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;tableName&quot;:&quot;&quot;,&quot;csvName&quot;:&quot;&quot;,&quot;csvEncoding&quot;:&quot;GBK&quot;,&quot;output-name&quot;:[&quot;产品名&quot;,&quot;市场&quot;,&quot;规格&quot;,&quot;最高价格&quot;,&quot;平均价格&quot;,&quot;最低价格&quot;,&quot;日期&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;output-value&quot;:[&quot;${name}&quot;,&quot;${market}&quot;,&quot;${specifications}&quot;,&quot;${top}&quot;,&quot;${avg}&quot;,&quot;${low}&quot;,&quot;${dataDate}&quot;],&quot;output-all&quot;:&quot;0&quot;,&quot;output-database&quot;:&quot;0&quot;,&quot;output-csv&quot;:&quot;0&quot;,&quot;shape&quot;:&quot;output&quot;} </JsonProperty> </mxCell> <mxCell id="15" value="定义变量" style="variable" parent="1" vertex="1"> <mxGeometry x="620" y="250" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;name&quot;,&quot;market&quot;,&quot;specifications&quot;,&quot;top&quot;,&quot;avg&quot;,&quot;low&quot;,&quot;dataDate&quot;],&quot;variable-description&quot;:[&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${dataList[index].selectors(&#39;table tbody tr td a&#39;)[0].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td a&#39;)[1].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[0].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[1].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[3].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[2].text()}&quot;,&quot;${dataList[index].selectors(&#39;table tbody tr td span&#39;)[4].text()}&quot;],&quot;shape&quot;:&quot;variable&quot;} </JsonProperty> </mxCell> <mxCell id="16" value="" style="strokeWidth=2;sharp=1;" parent="1" source="11" target="15" edge="1"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="18" value="" style="strokeWidth=2;sharp=1;" parent="1" source="15" target="13" edge="1"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="27" value="定义变量" style="variable" parent="1" vertex="1"> <mxGeometry x="90" y="440" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;urlList&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;[\&quot;https://price.21food.cn/product/939.html\&quot;,\&quot;https://price.21food.cn/product/1505.html\&quot;,\&quot;https://price.21food.cn/product/196.html\&quot;]&quot;],&quot;shape&quot;:&quot;variable&quot;} </JsonProperty> </mxCell> <mxCell id="29" value="循环" style="loop" parent="1" vertex="1"> <mxGeometry x="180" y="440" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;循环&quot;,&quot;loopItem&quot;:&quot;&quot;,&quot;loopVariableName&quot;:&quot;urlIndex&quot;,&quot;loopCount&quot;:&quot;${list.length(urlList)}&quot;,&quot;loopStart&quot;:&quot;0&quot;,&quot;loopEnd&quot;:&quot;-1&quot;,&quot;shape&quot;:&quot;loop&quot;} </JsonProperty> </mxCell> <mxCell id="31" value="定义变量" style="variable" parent="1" vertex="1"> <mxGeometry x="262" y="440" width="32" height="32" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;定义变量&quot;,&quot;loopVariableName&quot;:&quot;&quot;,&quot;variable-name&quot;:[&quot;url&quot;],&quot;variable-description&quot;:[&quot;&quot;],&quot;loopCount&quot;:&quot;&quot;,&quot;variable-value&quot;:[&quot;${urlList[urlIndex]}&quot;],&quot;shape&quot;:&quot;variable&quot;} </JsonProperty> </mxCell> <mxCell id="42" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="27" target="29"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="43" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="29" target="31"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="44" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="2" target="27"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> <mxCell id="45" value="" style="strokeWidth=2;sharp=1;" edge="1" parent="1" source="31" target="3"> <mxGeometry relative="1" as="geometry"/> <JsonProperty as="data"> {&quot;value&quot;:&quot;&quot;,&quot;exception-flow&quot;:&quot;0&quot;,&quot;lineWidth&quot;:&quot;2&quot;,&quot;line-style&quot;:&quot;sharp&quot;,&quot;lineColor&quot;:&quot;black&quot;,&quot;condition&quot;:&quot;&quot;,&quot;transmit-variable&quot;:&quot;1&quot;} </JsonProperty> </mxCell> </root> </mxGraphModel> 
    tiRolin
        1
    tiRolin  
    OP
       2023-08-07 16:39:35 +08:00
    还有我想问下这个框架怎么模拟点击操作?我看案例中打开新网页的方法是获取 url 拼接之后开启新的爬虫进行爬取
    但是有些我想要爬取数据的网址是不直接存在 html 中,要执行点击操作才会自动跳转到新网址,我在代码上使用 Selenium 框架可以执行操作,但是在 spiderflow 框架中又要怎么做才行?
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2874 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 34ms UTC 14:08 PVG 22:08 LAX 06:08 JFK 09:08
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86