最近正在学爬虫，主要用来抓取门户网站的评论，选用 Python 语言。有没有大神给点好的资料，或者一个好的指导方向。

最近正在学爬虫，主要用来抓取门户网站的评论，选用 Python 语言。有没有大神给点好的资料，或者一个好的指导方向。 - V2EX

Home Sign Up Sign In

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

推荐学习书目

Learn Python the Hard Way

Python Sites

PyPI - Python Package Index

http://diveintopython.org/toc/index.html

Pocoo

值得关注的项目

PyPy

Celery

Jinja2

Read the Docs

gevent

pyenv

virtualenv

Sentry

Shovel

Pyflakes

pytest

Python 编程

pep8 Checker

Styles

PEP 8

Google Python Style Guide

Code Style from The Hitchhiker's Guide

This topic created in 4226 days ago, the information mentioned may be changed or developed.

爬虫

抓取

Python

16 replies 2014-10-05 18:24:36 +08:00

mrytsr

Oct 4, 2014 via Android

Scrapy

mhycy

Oct 4, 2014

手写....
Requests + re + threading + logging
各种爽~

PS.其实是觉得框架太不灵活了

paulw54jrn

Oct 4, 2014

不是很复杂就是楼上说的
requests + re + threading/greenlets

或者是楼上上说的
Scrapy..

ShiehShieh

Oct 4, 2014

有没有什么好点的材料能学习嘛？ 0.0

binux

Oct 4, 2014

https://github.com/binux/pyspider
你值得拥有

no13bus

Oct 4, 2014

@binux 感觉torado经常用来监控，celery的监控flower就是用他来做的。

XadillaX

Oct 4, 2014

-。 - 为什么没多少人学 node 做爬虫呢？

chemzqm

Oct 4, 2014

node异步回调太恶心，占用内存太高，低配机器跑不了几个进程

R4rvZ6agNVWr56V0

Oct 4, 2014

曾经用twisted自己写过一个，后来才知道有scrapy这个爬虫框架，推荐scrapy

Codist

Oct 4, 2014

scrapy简单又方便，selector用起来也很舒服，不用写正则了

kenis

Oct 5, 2014

推荐用Scrapy，比较成熟的爬虫框架，资源也不少。

cha1

Oct 5, 2014

http://jecvay.com/category/smtech/python3-webbug/

https://github.com/Yixiaohan/codeparkshare#%E5%85%AB%E7%88%AC%E8%99%AB%E4%BB%A5%E5%8F%8A%E6%A8%A1%E6%8B%9F%E7%99%BB%E9%99%86%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A

框架什么的参考上面各位大大的.

briefcopy

Oct 5, 2014

WebCollector:
http://www.brieftools.info/document/webcollector/

imn1

Oct 5, 2014

我抓取的量很大，所以分离过程用wget抓取，py做parse，95%用正则，少量用 lxml+xpath
无论用什么，通读http协议+抓包工具是跑不掉的

ericls

Oct 5, 2014 via Android

requests pyquery

helloworld00

Oct 5, 2014

快速构建实时抓取集群

http://blog.nosqlfan.com/html/2604.html

/div>

About Help Advertise Blog API FAQ Solana 2599 Online Highest 6679

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 61ms UTC 11:36 PVG 19:36 LAX 04:36 JFK 07:36
Do have faith in what you're doing.