https://github.com/intohole/xspider 是再重复造轮子!但让我们一起熟悉
main.py: from xspider.spider.spider import BaseSpider from xspider.filters import urlfilter from kuailiyu import KuaiLiYu if __name__ == "__main__": spider = BaseSpider(name = "kuailiyu" , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"]) spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"])) spider.start() kuailiyu.py from xspider import processor from xspider.selector import xpath_selector from xspider import model class KuaiLiYu(processor.PageProcessor.PageProcessor): def __init__(self): super(KuaiLiYu , self).__init__() self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()") def process(self , page , spider): items = model.fileds.Fileds() items["title"] = self.title_extractor.find(page) items["url"] = page.url return items
![]() | 1 xiaozizayang 2017-11-23 16:25:48 +08:00 |
![]() | 2 tamlok 2017-11-23 16:49:58 +08:00 via Android |
![]() | 3 intohole OP @xiaozizayang 学习一下 |
![]() | 5 j1wu 2017-11-23 20:00:21 +08:00 Javascript 版本助攻,向大家学习 Orz https://github.com/j1wu/cli-scraper |
6 zhangysh1995 2017-11-23 21:39:56 +08:00 最近正好在学爬虫,收藏一个,楼主加油! |
![]() | 8 intohole OP @zhangysh1995 里面的 api 没有整理 , 这个爬虫专门为了机器不足 时间来换的开发 |
![]() | 9 sparkssssssss 2017-12-01 11:12:16 +08:00 马克,学习 |