问题

在使用requests多次访问同一个ip时,尤其是在高频率访问下,很容易出现Max retries exceeded with url的错误。

分析

一是要及时close()关闭,再是用tryCatch来反复抓取。循环继续访问并用sleep控制访问频率。

def get(url):
	try:
		res = requests.get(url)
		# 如果响应状态码不是 200,就主动抛出异常
		res.raise_for_status()
		# 关闭连接 !!!--非常重要
		res.close()
	except Exception as e:
		logger.error(e)
	else:
		return res.json()

别人写的抓取知乎文章。也用此法:

html=""
    while html == "":#因为请求可能被知乎拒绝,采用循环+sleep的方式重复发送,但保持频率不太高
        try:
            proxies = get_random_ip(ipList)
            print("这次试用ip:{}".format(proxies))
            r = requests.request("GET", url, headers=headers, params=querystring, proxies=proxies)
            r.encoding = 'utf-8'
            html = r.text
            return html
        except:
            print("Connection refused by the server..")
            print("Let me sleep for 5 seconds")
            print("ZZzzzz...")
            sleep(5)
            print("Was a nice sleep, now let me continue...")
            continue

综合

如何实现,整个抓取过程不断呢?就是python3报错了,只是跳出本次循环,仍会继续运行呢?

if __name__ == '__main__':
    for i in range(80):
        url = 'https://www.xxxx.com/qiye/{}.htm'.format(1000-i)
        try:
            getData(url)#此处写抓取函数,出错误也不怕
        except:
            print(url+"Let me sleep for 5 seconds")
            print("ZZzzzz...")
            time.sleep(5)
            print("Was a nice sleep, now let me continue...")
            continue