V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

Learn Python the Hard Way

Python Sites

PyPI - Python Package Index

http://diveintopython.org/toc/index.html

Pocoo

值得关注的项目

PyPy

Celery

Jinja2

Read the Docs

gevent

pyenv

virtualenv

Sentry

Shovel

Pyflakes

pytest

Python 编程

pep8 Checker

Styles

PEP 8

Google Python Style Guide

Code Style from The Hitchhiker's Guide

这是一个创建于 2396 天前的主题，其中的信息可能已经有所发展或是发生改变。

这是从博客地址搬过来的

在开始介绍 scrapy 的去重之前，先想想我们是怎么对 requests 对去重的。requests 只是下载器，本身并没有提供去重功能。所以我们需要自己去做。很典型的做法是事先定义一个去重队列，判断抓取的 url 是否在其中，如下：

crawled_urls = set() def check_url(url): if url not in crawled_urls: return True return False

此时的集合是保存在内存中的，随着爬虫抓取内容变多，该集合会越来越大，有什么办法呢？

接着往下看，你会知道的。

scrapy 的去重

scrapy 对 request 不做去重很简单，只需要在 request 对象中设置dont_filter为 True，如

yield scrapy.Request(url, callback=self.get_response, dont_filter=True)

看看源码是如何做的，位置

_fingerprint_cache = weakref.WeakKeyDictionary() def request_fingerprint(request, include_headers=None): if include_headers: include_headers = tuple(to_bytes(h.lower()) for h in sorted(include_headers)) cache = _fingerprint_cache.setdefault(request, {}) if include_headers not in cache: fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b'') if include_headers: for hdr in include_headers: if hdr in request.headers: fp.update(hdr) for v in request.headers.getlist(hdr): fp.update(v) cache[include_headers] = fp.hexdigest() return cache[include_headers]

注释过多，我就删掉了。谷歌翻译 + 人翻

返回请求指纹 请求指纹是唯一标识请求指向的资源的哈希。 例如，请使用以下两个网址： http://www.example.com/query?id=111&cat=222 http://www.example.com/query?cat=222&id=111 即使这两个不同的 URL 都指向相同的资源并且是等价的（即，它们应该返回相同的响应） 另一个例子是用于存储会话 ID 的 cookie。 假设以下页面仅可供经过身份验证的用户访问： http://www.example.com/members/offers.html 许多网站使用 cookie 来存储会话 ID，这会随机添加字段到 HTTP 请求，因此在计算时应该被忽略指纹。 因此，计算时默认会忽略 request headers。 如果要包含特定 headers，请使用 include_headers 参数，它是要计算 Request headers 的列表。

其实就是说：scrapy 使用 sha1 算法，对每一个 request 对象加密，生成 40 为十六进制数，如：'fad8cefa4d6198af8cb1dcf46add2941b4d32d78'。

我们看源码，重点是一下三行

 fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b'')

如果没有自定义 headers，只计算 method、url、和二进制 body，我们来计算下，代码：

print(request_fingerprint(scrapy.Request('http://www.example.com/query?id=111&cat=222'))) print(request_fingerprint(scrapy.Request('http://www.example.com/query?cat=222&id=111'))) print(request_fingerprint(scrapy.Request('http://www.example.com/query')))

输出：

fad8cefa4d6198af8cb1dcf46add2941b4d32d78 fad8cefa4d6198af8cb1dcf46add2941b4d32d78 b64c43a23f5e8b99e19990ce07b75c295165a923

可以看到第一条和第二条的密码是一样的，是因为调用了canonicalize_url方法，该方法返回如下

>>> import w3lib.url >>> >>> # sorting query arguments >>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50') 'http://www.example.com/do?a=50&b=2&b=5&c=3' >>> >>> # UTF-8 conversion + percent-encoding of non-ASCII characters >>> w3lib.url.canonicalize_url(u'http://www.example.com/r\u00e9sum\u00e9') 'http://www.example.com/r%C3%A9sum%C3%A9' >>>

scrapy 的去重默认会保存到内存中，如果任务重启，会导致内存中所有去重队列消失

scrapy-redis 的去重

scrapy-redis 重写了 scrapy 的调度器和去重队列，所以需要在 settings 中修改如下两列

# Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

一般我们会在 redis 中看到这两个，分别是去重队列和种子链接

先看看代码：重要代码

 def request_seen(self, request): """Returns True if request was already seen. Parameters ---------- request : scrapy.http.Request Returns ------- bool """ fp = self.request_fingerprint(request) # This returns the number of valuesadded, zero if already exists. added = self.server.sadd(self.key, fp) return added == 0 def request_fingerprint(self, request): """Returns a fingerprint for a given request. Parameters ---------- request : scrapy.http.Request Returns ------- str """ return request_fingerprint(request)

首先拿到 scrapy.http.Request 会先调用 self.request_fingerprint 去计算，也就是 scrapy 的 sha1 算法去加密，然后会向 redis 中添加该指纹。

该函数的作用是：计算该请求指纹，添加到 redis 的去重队列，如果已经存在该指纹，返回 True。

我们可以看到，只要有在 settings 中添加DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"，就会在 redis 中新加一列去重队列，说下这样做的优劣势：

优点：将内存中的去重队列序列化到 redis 中，及时爬虫重启或者关闭，也可以再次使用，你可以使用 SCHEDULER_PERSIST 来调整缓存
缺点：如果你需要去重的指纹过大，redis 占用空间过大。8GB=8589934592Bytes，平均一个去重指纹 40Bytes，约可以存储 214,748,000 个(2 亿)。所以在做关系网络爬虫中，序列化到 redis 中可能并不是很好，保存在内存中也不好，所以就产生了布隆过滤器。

布隆过滤器

它的原理是将一个元素通过 k 个哈希函数，将元素映射为 k 个比特位，在 bitmap 中把它们置为 1。在验证的时候只需要验证这些比特位是否都是 1 即可，如果其中有一个为 0，那么元素一定不在集合里，如果全为 1，则很可能在集合里。（因为可能会有其它的元素也映射到相应的比特位上）

同时这也导致不能从 Bloom filter 中删除某个元素，无法确定这个元素一定在集合中。以及带来了误报的问题，当里面的数据越来越多，这个可能在集合中的靠谱程度就越来越低。（由于哈希碰撞，可能导致把不属于集合内的元素认为属于该集合）

布隆过滤器的缺点是错判，就是说，不在里面的，可能误判成在里面，但是在里面的，就一定在里面，而且无法删除其中数据。

>>> import pybloomfilter >>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom') >>> fruit.update(('apple', 'pear', 'orange', 'apple')) >>> len(fruit) 3 >>> 'mike' in fruit False >>> 'apple' in fruit True

python3 使用pybloomfilter的例子。

那么如何在 scrapy 中使用布隆过滤器呢，崔大大已经写好了，地址：ScrapyRedisBloomFilter，已经打包好，可以直接安装

pip install scrapy-redis-bloomfilter

在 settings 中这样配置：

# Ensure use this Scheduler SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter" # Redis URL REDIS_URL = 'redis://localhost:6379/0' # Number of Hash Functions to use, defaults to 6 BLOOMFILTER_HASH_NUMBER = 6 # Redis Memory Bit of Bloomfilter Usage, 30 means 2^30 = 128MB, defaults to 30 BLOOMFILTER_BIT = 30 # Persist SCHEDULER_PERSIST = True

其实也是修改了调度器与去重方法，有兴趣的可以了解下。

4 条回复 2019-03-27 10:35:19 +08:00