I have the code:
from bs4 import BeautifulSoup
import requests
head = {'User-Agent': 'Mozilla / 4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1)'}
proxi = {
'http': 'http://195.9.149.198:8081',
}
query = input ('What are you searching for ?:')
number = input ('How many pages:')
url = 'http: //www.google.com/search? q ='
page = requests.get (url + query, headers = head, proxies = proxi)
for index in range (int (number)):
soup = BeautifulSoup (page.text, "html.parser")
next_page = soup.find ("a", class _ = "fl")
next_link = ("https://www.google.com" + next_page ["href"])
h3 = soup.find_all ("h3", class _ = "r")
for elem in h3:
elem = elem.contents [0]
link = ("https://www.google.com" + elem ["href"])
print (link)
page = requests.get (next_link)
After working with this code for half a day, sending multiple requests for parsing url addresses, everything went fine for me, except when I added inurl:
This code worked flawlessly. But after a certain number of requests, I was constantly getting the error TypeError: 'NoneType' object is not subscriptable
without even adding inurl:
I understand that captcha appears, and it is written that very suspicious traffic is coming from my network. And because of this, it blocks, trying to add headers
and do this action through the proxi
server. This error just popped up for me. What should I do?
Answer 1, authority 100%
A possible solution would be to wrap all the code inside for
in try ... except
, and fall asleep for some time in the place where the error is processed. It seems to me a good idea to increase this time a little after each captcha. For example, start with a value equal to five seconds, and increase by one second. Possible implementation:
from time import sleep
from bs4 import BeautifulSoup
import requests
head = {'User-Agent': 'Mozilla / 4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1)'}
proxi = {
'http': 'http://195.9.149.198:8081',
}
time_to_sleep_when_captcha = 5
query = input ('What are you searching for ?:')
number = input ('How many pages:')
url = 'http://www.google.com/search?q='
page = requests.get (url + query, headers = head, proxies = proxi)
for index in range (int (number)):
try:
soup = BeautifulSoup (page.text, "html.parser")
next_page = soup.find ("a", class _ = "fl")
next_link = ("https://www.google.com" + next_page ["href"])
h3 = soup.find_all ("h3", class _ = "r")
for elem in h3:
elem = elem.contents [0]
link = ("https://www.google.com" + elem ["href"])
print (link)
page = requests.get (next_link)
except:
sleep (time_to_sleep_when_captcha)
time_to_sleep_when_captcha + = 1
Answer 2, authority 94%
TypeError: ‘NoneType’ object is not subscriptable
Occurs when you try to access a None Object by index.
& gt; & gt; & gt; t = None
& gt; & gt; & gt; t [0]
Traceback (most recent call last):
File "& lt; stdin & gt;", line 1, in & lt; module & gt;
TypeError: 'NoneType' object is not subscriptable
Check the call by index (there should be a line number in the error)