在使用requests多次访问同一个ip时,尤其是在高频率访问下,很容易出现Max retries exceeded with url的错误。
一是要及时close()关闭,再是用tryCatch来反复抓取。循环继续访问并用sleep控制访问频率。
def get(url): try: res = requests.get(url) # 如果响应状态码不是 200,就主动抛出异常 res.raise_for_status() # 关闭连接 !!!--非常重要 res.close() except Exception as e: logger.error(e) else: return res.json()
别人写的抓取知乎文章。也用此法:
html="" while html == "":#因为请求可能被知乎拒绝,采用循环+sleep的方式重复发送,但保持频率不太高 try: proxies = get_random_ip(ipList) print("这次试用ip:{}".format(proxies)) r = requests.request("GET", url, headers=headers, params=querystring, proxies=proxies) r.encoding = 'utf-8' html = r.text return html except: print("Connection refused by the server..") print("Let me sleep for 5 seconds") print("ZZzzzz...") sleep(5) print("Was a nice sleep, now let me continue...") continue
如何实现,整个抓取过程不断呢?就是python3报错了,只是跳出本次循环,仍会继续运行呢?
if __name__ == '__main__': for i in range(80): url = 'https://www.xxxx.com/qiye/{}.htm'.format(1000-i) try: getData(url)#此处写抓取函数,出错误也不怕 except: print(url+"Let me sleep for 5 seconds") print("ZZzzzz...") time.sleep(5) print("Was a nice sleep, now let me continue...") continue