request 得到'Windows-1254'这种编码，如何转成'utf-8' or 'gbk'?（提示有\ufffd 无法转换）

 if codeFormat == 'Windows-1254': # htmlStr = r.text.decode('windows-1254').encode('utf8') htmlStr = r.text.encode(r.encoding).decode(r.apparent_encoding) # htmlStr = r.encode('windows-1254').decode('utf8') #tmpStr = r.text.encode(r.encoding).decode('gbk') print(htmlStr)

无论是 request 还是 request-html,反正就是怎么转，都提示有些字符无法转...
理论上.encode(r.encoding).decode(r.apparent_encoding)几乎可以万能了吧
python 对编码方面的短板？

Error ''' codec can't encode character '\ufffd' in position 190: illegal multibyte sequence' happened on line 973

encode

htmlstr

decode

encoding

8 条回复 2019-11-11 18:16:26 +08:00

sun1991

2019 年 11 月 9 日

可能 codeFormat 本身就是错误的, content 并不是 codeFormat 指定的编码. 之前碰到过网页的 charset 错误的情况.

uti6770werty

2019 年 11 月 9 日

@sun1991 确定 codeFormat 是对的，（也做过调试终端，确定是 Windows-1254 无疑）
难道是反爬的方法？
```
codeFormat = r.apparent_encoding
tmpStr = ' '
if codeFormat == 'ISO-8859-1':
tmpStr = r.text.encode(r.encoding).decode('gbk')
session.close()
return tmpStr
pass
```

如这个就能很好转换

weyou

2019 年 11 月 9 日 via Android

Windows-1254 可能是 requests 猜出的编码，不一定准确。但这也表明 http headers 里面没有提供 body 的编码属性。你要是知道数据的真实编码，直接 r.content.decode(真实编码).encode("utf8 或者 gbk")。注意 gbk 不一定成功，除非你确定所有字符都在 gbk 字符集里。

ysc3839

2019 年 11 月 10 日 via Android

代码中“codeFormat”是哪来的？
另外如果可以的话发一下网址，或者发 curl/wget 下载的原始数据吧。

ysc3839

2019 年 11 月 10 日 via Android

看到了 codeFormat 是 r.apparent_encoding，这样的话用 r.encoding = r.apparent_encoding 修改当前编码，再通过 r.text 读取即可。
参考 https://stackoverflow.com/a/52615216

gwy15

2019 年 11 月 10 日

你这个问题是直接用 r.text 包含一个隐式 decode，如果 requests 没有成功猜对编码会直接抛异常。需要手动解码的情况，采用 bytes = r.content.decode('utf8').encode('utf8')

另外如果 utf8，gbk，Windows-1254 都失败，可以尝试一下 GB18030，或者拷贝出来原值找个猜测解码的网站

uti6770werty

2019 年 11 月 11 日

@sun1991
@weyou
@ysc3839
@gwy15

谢谢大家，
琢磨了两天，最后从知识范围内的认知是，只能这样，
从大集合的编码转过来 utf-8，怎么都会有些字符无法完成转换的，并不是转换方法的问题
最后在 string 类型里 replace('\ufffd','')，最后 utf-8 写入到文件，内容看起来也不像是不完整的，或者缺失的。

sohusi

2019 年 11 月 11 日

看看响应有没有被压缩，requests 库对付不了 brotli 压缩