放出去的爬虫被泛解析站群困住了,如何才能爬出去? - V2EX
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
dsg001

放出去的爬虫被泛解析站群困住了,如何才能爬出去?

  •  
  •   dsg001 Sep 7, 2016 5223 views
    This topic created in 3545 days ago, the information mentioned may be changed or developed.
    Supplement 1    Sep 8, 2016
    现在用链接文本 /所有文本做简单的过滤,先不考虑误伤,爬虫逃出去再说
    15 replies    2024-07-09 16:54:45 +08:00
    hack
        1
    hack  
       Sep 7, 2016
    我的站群一天就能让百度谷歌神马爬掉几个 G ,人也淡定了
    wjm2038
        2
    wjm2038  
       Sep 7, 2016 via Android
    @hack 来个域名看看
    hack
        3
    hack  
       Sep 7, 2016
    @wjm2038 不来,淡定啊,爬虫能识别出来站群的话,就能跳出,实际上现有搜索引擎识别站群的能力都很有限
    wjm2038
        4
    wjm2038  
       Sep 7, 2016 via Android
    @hack 楼主给的网站我看了。。爬虫是会自己停止么。。这种感觉不是自主学习的爬虫都得困里面
    hack
        5
    hack  
       Sep 7, 2016
    @wjm2038 爬虫会记录下自己的任务,下次继续爬,基本上一个月爬虫爬掉几百 G 很正常的,反正站群就是引流的,无所谓了,只要不爬死 server 就行
    zhjits
        6
    zhjits  
       Sep 7, 2016
    要么域名里面有四位以上纯数字就扔掉,要么给子域名随机动一个 bit 再抓一次,如果页面相同部分超过 90% 就判断成辣鸡站
    dsg001
        7
    dsg001  
    OP
       Sep 8, 2016
    @zhjits 相似度没用,全部都是随机调用, js 写入框架
    wyntergreg
        8
    wyntergreg  
       Sep 8, 2016
    爬过的站你不记录吗,别走回头路总是行的吧
    dsg001
        9
    dsg001  
    OP
       Sep 8, 2016
    @wyntergreg 泛解析的站群,无限二级域名,记录也没用
    bombless
        10
    bombless  
       Sep 8, 2016
    记录二级域名的访问数量,然后限制每个二级域名的访问数吧
    至于说大量使用三级、 4 级域名的站,不爬也罢, 233
    xderam
        11
    xderam  
       Sep 8, 2016
    一般的域名不太会超过百个吧,先判断下域名个数,然后再爬啊。
    dsg001
        12
    dsg001  
    OP
       Sep 8, 2016
    @xderam
    @bombless
    限制二级域名数量误伤太大, github.ioblogspot.com 等都是大量二级域名
    exch4nge
        13
    exch4nge  
       Sep 9, 2016 via iPhone
    @dsg001 可以获取下主域名的排名
    haitang
        14
    haitang  
       Sep 9, 2016   1
    如果是泛解析,可以在某主域二级域名过多时,尝试进行无意义的多个二级域名解析,如随机几位英文+数字组合,进行多次验证,可以解析且打开不是 404 等,基本都是垃圾站了
    yq70Wfm8y9vY6yh3
        15
    yq70Wfm8y9vY6yh3  
       Jul 9, 2024
    16c4a
    About     Help     Advertise     Blog     API     FAQ     Solana     1085 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 63ms UTC 18:21 PVG 02:21 LAX 11:21 JFK 14:21
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86