
下面的代码,用于搜索维基百科上凯文贝肯词条里所有指向其他词条的链接。 搜索结果中那些指向其它词条页面的链接应具有如下特点: 它们都在 id 是 bodyContent 的 div 标签里 URL 链接不包含分号 URL 链接都以 /wiki/ 开头 代码如下: from urllib.request import urlopen from bs4 import BeautifulSoup import datetime import random import re random.seed(datetime.datetime.now()) def getLinks(articleUrl): html = urlopen(" https://en.wikipedia.org "+articleUrl) bsObj = BeautifulSoup(html) return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$")) links = getLinks("/wiki/Kevin_Bacon") print(links) 代码运行后报错如下,请问这是为什么呢?感谢指点! Traceback (most recent call last): File "c:\Users\A\AppData\Roaming\Code\User\test\2.py", line 11, in <module> links = getLinks("/wiki/Kevin_Bacon") File "c:\Users\A\AppData\Roaming\Code\User\test\2.py", line 8, in getLinks html = urlopen(" https://en.wikipedia.org "+articleUrl) File "D:\Python\Python3\lib\urllib\request.py", line 223, in urlopen return opener.open(url, data, timeout) File "D:\Python\Python3\lib\urllib\request.py", line 526, in open respOnse= self._open(req, data) File "D:\Python\Python3\lib\urllib\request.py", line 544, in _open '_open', req) File "D:\Python\Python3\lib\urllib\request.py", line 504, in _call_chain result = func(*args) File "D:\Python\Python3\lib\urllib\request.py", line 1361, in https_open cOntext=self._context, check_hostname=self._check_hostname) File "D:\Python\Python3\lib\urllib\request.py", line 1320, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed> 1 impyf104 2017-08-29 23:35:11 +08:00 html = urlopen("https://en.wikipedia.org"+articleUrl) 字符串里别加空格 |
2 Tompes 2017-08-30 00:19:32 +08:00 via Android |
3 linhua 2017-08-30 08:25:26 +08:00 DNS 解析错误,需要挂代理 |
4 saximi OP 程序是在 VSCODE 下执行的,去掉 URL 中的空格后程序报错如下。 如果把“ BeautifulSoup(html)”这句改为“ BeautifulSoup(html,'lxml')”则不会报错,但是从出错信息来看好像是没有明确指定一个 parser 导致的。 请问 parser 是什么?以及在 VSCODE 下如何设置才能做到明确指定 parser,从而让“ BeautifulSoup(html)”这句代码运行不出错呢?感谢指点! D:\Python\Python3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. The code that caused this warning is on line 91 of the file C:\Users\A\.vscode\extensions\donjayamanne.python-0.7.0\pythonFiles\PythonTools\visualstudio_py_launcher.py. To get rid of this warning, change code that looks like this: BeautifulSoup(YOUR_MARKUP}) to this: BeautifulSoup(YOUR_MARKUP, "lxml") markup_type=markup_type)) |
5 Vernsu 2017-08-31 17:27:23 +08:00 在读《 Python 网络数据采集》? lmgtfy |