Indexation of pages is something that SEOs are constantly interested in. Going through Search Console and individually clicking on URLs and waiting for response is a time-waster.
And don't we all hate time-wasting?
Recently Google also limited amount of requests to only 10 per 24 hours. This poses an actual blocker to larger sites and teams. We're gonna increase that by a whopping 2000%, while cutting the execution time into almost nothing.
A Python program that indexes URLs. It will have 2 modes: automatic and manual.
Code will be available in this repository
Fire up your favorite Python IDE and create a new project in it.
pip install google-api-python-client google-auth-oauthlib requests
With APIs in general , I like writing a simple boilerplate code to test that the authentication flow is working and we are getting authenticated and a good response from the API.
Replace the local/client_secret.json with the path to your OAuth Client Secrets that you downloaded from GCP.
from __future__ import print_function import os.path from google.auth.transport.requests import Request from google.oauth2.credentials import Credentials from google_auth_oauthlib.flow import InstalledAppFlow SCOPES = ['https://www.googleapis.com/auth/indexing'] def authenticate(): creds = None if os.path.exists('local/token.json'): creds = Credentials.from_authorized_user_file('local/token.json', SCOPES) if not creds or not creds.valid: if creds and creds.expired and creds.refresh_token: creds.refresh(Request()) else: flow = InstalledAppFlow.from_client_secrets_file( 'local/client_secret.json', SCOPES) creds = flow.run_local_server(port=0) # Save the credentials for the next run with open('local/token.json', 'w') as token: token.write(creds.to_json()) return creds if __name__ == '__main__': credentials = authenticate() print(credentials)
When you run this, it will open up a Chrome window and ask you to complete the authentication flow. Finally, it should print a line that looks like this:
<google.oauth2.credentials.Credentials object at 0x7fe75c157b05>
Given that is working as expected, we can move on to the next step of passing an URL for the Indexation API and look for response.
Now let's write another function, with purpose of indexing a single URL.
First we import some additional libraries:
from googleapiclient.discovery import build from googleapiclient.errors import HttpError import time
Next we add another function, update_single_url(url, creds). It takes the url and credentials as an argument.
def update_single_url(url, creds): try: service = build('indexing', 'v3', credentials=creds) callback = { "url": url, "type": "URL_UPDATED" } update_single = service.urlNotifications().publish(body=callback).execute() print(update_single.get("urlNotificationMetadata")) # request = service.new_batch_http_request(callback=callback) except HttpError as err: print(err) time.sleep(0.5)
Let's also now modify our main loop function a bit:
if __name__ == '__main__': credentials = authenticate() update_single_url("https://sentienttechnology.pro/", credentials)
Right, so now you might think we can just write a for-loop and pass multiple URLs to this function, and let it index them. However, there is another, faster and more resource-effective option that I want to showcase. It uses a service builtin feature called BatchHTTPRequest. It allows us to batch several requests and send them all at once. This concatenates up to 200 network requests just to one. The service object that we build has a method for this called new_batch_http_request()
Let's see how this would look line in code.
def update_batch_url(batch, creds): try: service = build('indexing', 'v3', credentials=creds) batch_request = service.new_batch_http_request() for url in batch: callback = { "url": url, "type": "URL_UPDATED" } batch_request.add(service.urlNotifications().publish(body=callback)) batch_request.execute() except HttpError as err: print(err) time.sleep(0.5)
Then update the main loop to following
if __name__ == '__main__': credentials = authenticate() # update_single_url("https://sentienttechnology.pro/", credentials) # You can also parse a csv or txt document into a list as well. url_batch = [ 'https://sentienttechnology.pro/', 'https://sentienttechnology.pro/about/', 'https://sentienttechnology.pro/articles/rethinking-lighthouse-performance/' ] update_batch_url(url_batch, credentials)
Now, if we run this, you will notice that nothing gets printed out. We can of course print out the batch_request object itself, but unfortunately it has no response. However, as long as it runs without errors, we can consider the request successful. In case you hit a rate limit, you will receive 429 error response.
Allright, now we move to the fun stuff and start making this script actually usable.
As mentioned previously in this article, we can also notify Search Console that URL has been deleted. This can be useful as well in some cases. To do that, all we need to do is update the callbacks "type" to URL_DELETE instead of URL_UPDATE.
It's an easy update:
def update_single_url(url, type, creds): try: service = build('indexing', 'v3', credentials=creds) callback = { "url": url, "type": type } update_single = service.urlNotifications().publish(body=callback).execute() print(update_single.get("urlNotificationMetadata")) # request = service.new_batch_http_request(callback=callback) except HttpError as err: print(err) time.sleep(0.5)
All we did here was to add another argument and then update the type to use that variable.
Second, we're gonna use argparse library so we use the program without hardcoding. All of that will take place in the main function. Also let's add some error handling in case something is off with the commands so the program terminates before sending a malformed API request.
from argparse import ArgumentParser
Then we update the main loop, adding all the parsing logic.
if __name__ == '__main__': credentials = authenticate() parser = ArgumentParser() parser.add_argument('--mode', required=True, choices=['single', 'batch'], type=str, help='single or batch') parser.add_argument('--url', type=str, help='URL to be updated') parser.add_argument('--type', choices=['update', 'delete'], type=str, help='Type of update') parser.add_argument('--batch_file', type=str, help='Path to txt file with batch of URLs') args = parser.parse_args() if args.mode == 'single': try: if args.type is None: print('Please provide a type of update') elif args.url is None: print('Please provide a URL to update') else: if args.type == 'delete': req_type = 'URL_DELETED' else: req_type = 'URL_UPDATED' update_single_url(args.url, req_type, credentials) except Exception as e: print(f"An error occurred: {e}") if args.mode == 'batch': try: if args.batch_file is None: print('Please provide a path to the batch file') else: with open(args.batch_file, 'r') as file: batch = file.readlines() update_batch_url(batch, credentials) except Exception as e: print(f"An error occurred: {e}")
python main.py --mode single --url https://sentienttechnology.pro/ --type update
Now that we have built the basic API functions, we can move on to considering automation. It will be polling app, that checks the sitemap periodically. If changes are detected, the new URLs will be sent to the API using the update_singe_url() function.
This is a simple approach that does not need backend integration, which is positive. The downside is a small delay in the request (depending on the delay you set) and the fact that it's dependent on sitemap and robots.txt (from where we discover the sitemap) being functional. We'll write a bit of code to try and adjust for errors, but it's important to follow up on the logs that will be created by the program.
Additionally, while sitemaps follow the same format in general, they might differ so there is a possibility you would need to adjust the parsing.
So let's add a function that checks sitemaps and if an update is detected, pushes an update through. Add this point, we're also gonna adhere to best coding practices and using logging instead of print() for logging while the program is running.
Let's begin by importing logging and adding configuration for logs:
import logging if not os.path.exists('logs/'): os.makedirs('logs') formatter = logging.Formatter('%(asctime)s | %(levelname)s | %(message)s') info_logger = logging.getLogger('info_logger') info_logger.setLevel(logging.INFO) info_log_file_handler = logging.FileHandler('logs/info.log') info_log_file_handler.setFormatter(formatter) info_logger.addHandler(info_log_file_handler) error_logger = logging.getLogger('error_logger') error_logger.setLevel(logging.ERROR) error_log_file_handler = logging.FileHandler('logs/error.log') error_log_file_handler.setFormatter(formatter) error_logger.addHandler(error_log_file_handler) error_log_console_handler = logging.StreamHandler() error_log_console_handler.setLevel(logging.ERROR) error_log_console_handler.setFormatter(formatter) error_logger.addHandler(error_log_console_handler)
Now you can change all the print functions in the code to info and error logs. You're free to do it as you want, but here is an example:
def update_single_url(url, req_type, creds): try: service = build('indexing', 'v3', credentials=creds) callback = { "url": url, "type": req_type } update_single = service.urlNotifications().publish(body=callback).execute() info_logger.info(update_single.get("urlNotificationMetadata")) except HttpError as err: error_logger.error(err) time.sleep(0.5)
Feel free to play around with formatting, whatever works best for your own process. In general though, I recommend keeping log messages conscise and informative so they are useful (and don't fill in your hard drive)
We're gonna need to install some extra libraries here first to fetch and process the sitemap:
pip install requests bs4 lxml pandas
Now we create a bit more complex function, that does several things. On first run, it fetches the sitemap and saves it to a csv file. On consequent runs, it compares the two sitemaps and requests indexing from URLs that have a new date on them.
def check_sitemap_for_changes(sitemap_url, creds): def parse_sitemap_to_dataframe(sitemap_content): soup = BeautifulSoup(sitemap_content, 'xml') url_elements = soup.find_all('url') urls = [] last_updated = [] for url_element in url_elements: loc = url_element.find('loc').text.strip() urls.append(loc) lastmod = url_element.find('lastmod') if lastmod: last_updated.append(lastmod.text.strip()) else: last_updated.append(None) data = {'URL': urls, 'loc': last_updated} return pd.DataFrame(data) def fetch_sitemap(url): try: response = requests.get(url) if response.status_code == 200: return response.text else: error_logger.error(f"Failed to fetch sitemap. Status code: {response.status_code}") return None except Exception as e: error_logger.error(e) return None def save_dataframe_to_csv(dataframe, file_name): dataframe.to_csv(file_name, index=False) domain = urlparse(sitemap_url).netloc csv_file_name = f"sitemaps/{domain}_sitemap.csv" sitemap_content = fetch_sitemap(sitemap_url) if sitemap_content: main_dataframe = parse_sitemap_to_dataframe(sitemap_content) sitemapindex_elements = BeautifulSoup(sitemap_content, 'xml').find_all('sitemapindex') if sitemapindex_elements: for sitemapindex_element in sitemapindex_elements: sub_sitemap_elements = sitemapindex_element.find_all('loc') for sub_sitemap_element in sub_sitemap_elements: sub_sitemap_url = sub_sitemap_element.text.strip() sub_sitemap_content = fetch_sitemap(sub_sitemap_url) if sub_sitemap_content: sub_dataframe = parse_sitemap_to_dataframe(sub_sitemap_content) main_dataframe = pd.concat([main_dataframe, sub_dataframe], ignore_index=True) if os.path.exists(csv_file_name): existing_df = pd.read_csv(csv_file_name) main_dataframe.drop_duplicates(subset='URL', keep='first', inplace=True) existing_df.drop_duplicates(subset='URL', keep='first', inplace=True) merged_df = pd.merge( main_dataframe[['URL', 'loc']], existing_df[['URL', 'loc']], on='URL', suffixes=('_new', '_old') ) changed_rows = merged_df[merged_df['loc_old'] != merged_df['loc_new']] if not changed_rows.empty: for index, row in changed_rows.iterrows(): update_single_url(row['URL'], 'URL_UPDATED', creds) if not os.path.exists('sitemaps'): os.makedirs('sitemaps') save_dataframe_to_csv(main_dataframe, csv_file_name) else: info_logger.info(f"No changes detected in {sitemap_url}")
Now we update the main loop with new command to run autopilot. We still use the print function here for manual actions as it's easier and doesn't inflate the logs. But we also write some things into log as well. Again, feel free to modify it as per your preferences
if __name__ == '__main__': credentials = authenticate() polling_period = 10 parser = ArgumentParser() parser.add_argument('--mode', required=True, choices=['single', 'batch', 'autopilot'], type=str, help='single or batch') parser.add_argument('--followed-sitemaps', type=str, help='Path to txt file with followed domains') parser.add_argument('--url', type=str, help='URL to be updated') parser.add_argument('--type', choices=['update', 'delete'], type=str, help='Type of update') parser.add_argument('--batch-file', type=str, help='Path to txt file with batch of URLs') args = parser.parse_args() if args.mode == 'single': info_logger.info("Single mode started") try: if args.type is None: print('Please provide a type of update') elif args.url is None: print('Please provide a URL to update') else: if args.type == 'delete': req_type = 'URL_DELETED' else: req_type = 'URL_UPDATED' update_single_url(args.url, req_type, credentials) except Exception as e: print(f"An error occurred: {e}") error_logger.error(e) if args.mode == 'batch': info_logger.info("Batch mode started") try: if args.batch_file is None: print('Please provide a path to the batch file') else: with open(args.batch_file, 'r') as file: batch = file.readlines() update_batch_url(batch, credentials) except Exception as e: print(f"An error occurred: {e}") error_logger.error(e) if args.mode == 'autopilot': info_logger.info("Autopilot mode started") try: if args.followed_sitemaps is None: print('Please provide a path to the followed domains file') else: while True: with open(args.followed_sitemaps, 'r') as file: followed_sitemaps = file.readlines() for sitemap in followed_sitemaps: check_sitemap_for_changes(sitemap, credentials) time.sleep(polling_period) except Exception as e: print(f"An error occurred: {e}") error_logger.error(e)
And that's it!
This program has a lot of possibility for extension - you can scope it to only consider it certain URLs or also add a feature to automatically request deletion for deleted URLs. However, I would suggest that this is always done manually - just in case there is problem with some sub-sitemap, you won't face an issue where program requests deletion of URLs you want to keep.
One of the things to note is, that the daily limit for requests is only 200. So, if you run a set of sites that are close to or over that limit, it will cause issues.
It takes time and effort to create these articles and share my work. I've uploaded this code to a private GitHub repository.
To gain access to it, leave a like and share the post about this article I've published on LinkedIn. Drop a comment "Shared" and I will message you couple days time so we get you access.
You can contact me on LinkedIn or through the contact form