Automate Indexation with Indexing API

Arto Kylmanen

Sept. 29, 2023

Indexation of pages is something that SEOs are constantly interested in. Going through Search Console and individually clicking on URLs and waiting for response is a time-waster.

And don't we all hate time-wasting?

Recently Google also limited amount of requests to only 10 per 24 hours. This poses an actual blocker to larger sites and teams. We're gonna increase that by a whopping 2000%, while cutting the execution time into almost nothing.

Outcome

A Python program that indexes URLs. It will have 2 modes: automatic and manual.

Code will be available in this repository

If you do this project for an organization, make sure you have organizational permissions from IT/Management to create API accounts.

Prerequisites

Python programming skills
GCP (Google Cloud Platform) account with Indexing API service activated
OAuth2 Client ID (in json format)
Admin access to GSC (Search Console) Property

There are many excellent tutorials on how to get started with Google Cloud Platform. See the links in the list above.

Setup

Fire up your favorite Python IDE and create a new project in it.

Install Required Libraries

pip install google-api-python-client google-auth-oauthlib requests

Boilerplate

With APIs in general , I like writing a simple boilerplate code to test that the authentication flow is working and we are getting authenticated and a good response from the API.

Replace the local/client_secret.json with the path to your OAuth Client Secrets that you downloaded from GCP.

Indexation API boilerplate

from __future__ import print_function

import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow

SCOPES = ['https://www.googleapis.com/auth/indexing']


def authenticate():
    creds = None
    if os.path.exists('local/token.json'):
        creds = Credentials.from_authorized_user_file('local/token.json', SCOPES)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'local/client_secret.json',
                SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('local/token.json', 'w') as token:
            token.write(creds.to_json())
    return creds


if __name__ == '__main__':
    credentials = authenticate()
    print(credentials)

So what exactly happens in this code? The way OAuth works, is that the client secrets work as logins to Google Services, from where you receive an access token. This token is what's used to actually authenticate the API.

When you run this, it will open up a Chrome window and ask you to complete the authentication flow. Finally, it should print a line that looks like this:

<google.oauth2.credentials.Credentials object at 0x7fe75c157b05>

Given that is working as expected, we can move on to the next step of passing an URL for the Indexation API and look for response.

Important note! It's possible to notify Google about deletion of URLs as well with this API. It's hard to go wrong with this, but know what commands you are giving. Especially if you use ChatGPT or some other assistant when customizing this code.

Indexing an URL

Now let's write another function, with purpose of indexing a single URL.

First we import some additional libraries:

Additional imports

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import time

Next we add another function, update_single_url(url, creds). It takes the url and credentials as an argument.

Function to index a single URL

def update_single_url(url, creds):
    try:
        service = build('indexing', 'v3', credentials=creds)
        callback = {
            "url": url,
            "type": "URL_UPDATED"
        }
        update_single = service.urlNotifications().publish(body=callback).execute()
        print(update_single.get("urlNotificationMetadata"))

        # request = service.new_batch_http_request(callback=callback)

    except HttpError as err:
        print(err)
    time.sleep(0.5)

Let's also now modify our main loop function a bit:

Main loop update

if __name__ == '__main__':
    credentials = authenticate()
    update_single_url("https://sentienttechnology.pro/", credentials)

The URL that you request to be modified has to match EXACTLY. No 3xx. Remember to add or omit the final slash "/" depending on your structure. It's a common pitfall

Bulking it up

Right, so now you might think we can just write a for-loop and pass multiple URLs to this function, and let it index them. However, there is another, faster and more resource-effective option that I want to showcase. It uses a service builtin feature called BatchHTTPRequest. It allows us to batch several requests and send them all at once. This concatenates up to 200 network requests just to one. The service object that we build has a method for this called new_batch_http_request()

Let's see how this would look line in code.

update_batch_url()

def update_batch_url(batch, creds):
    try:
        service = build('indexing', 'v3', credentials=creds)
        batch_request = service.new_batch_http_request()
        for url in batch:
            callback = {
                "url": url,
                "type": "URL_UPDATED"
            }
            batch_request.add(service.urlNotifications().publish(body=callback))
        batch_request.execute()
    except HttpError as err:
        print(err)
    time.sleep(0.5)

Then update the main loop to following

updated main loop

if __name__ == '__main__':
    credentials = authenticate()
    # update_single_url("https://sentienttechnology.pro/", credentials)
    # You can also parse a csv or txt document into a list as well.
    url_batch = [
        'https://sentienttechnology.pro/',
        'https://sentienttechnology.pro/about/',
        'https://sentienttechnology.pro/articles/rethinking-lighthouse-performance/'
    ]
    update_batch_url(url_batch, credentials)

Now, if we run this, you will notice that nothing gets printed out. We can of course print out the batch_request object itself, but unfortunately it has no response. However, as long as it runs without errors, we can consider the request successful. In case you hit a rate limit, you will receive 429 error response.

Adding features

Allright, now we move to the fun stuff and start making this script actually usable.

As mentioned previously in this article, we can also notify Search Console that URL has been deleted. This can be useful as well in some cases. To do that, all we need to do is update the callbacks "type" to URL_DELETE instead of URL_UPDATE.

It's an easy update:

Updated function

def update_single_url(url, type, creds):
    try:
        service = build('indexing', 'v3', credentials=creds)
        callback = {
            "url": url,
            "type": type
        }
        update_single = service.urlNotifications().publish(body=callback).execute()
        print(update_single.get("urlNotificationMetadata"))

        # request = service.new_batch_http_request(callback=callback)

    except HttpError as err:
        print(err)
    time.sleep(0.5)

All we did here was to add another argument and then update the type to use that variable.

Second, we're gonna use argparse library so we use the program without hardcoding. All of that will take place in the main function. Also let's add some error handling in case something is off with the commands so the program terminates before sending a malformed API request.

Import ArgumentParser

from argparse import ArgumentParser

Then we update the main loop, adding all the parsing logic.

Updated main loop

if __name__ == '__main__':
    credentials = authenticate()
    parser = ArgumentParser()
    parser.add_argument('--mode', required=True, choices=['single', 'batch'], type=str, help='single or batch')
    parser.add_argument('--url', type=str, help='URL to be updated')
    parser.add_argument('--type', choices=['update', 'delete'], type=str, help='Type of update')
    parser.add_argument('--batch_file', type=str, help='Path to txt file with batch of URLs')
    args = parser.parse_args()
    if args.mode == 'single':
        try:
            if args.type is None:
                print('Please provide a type of update')
            elif args.url is None:
                print('Please provide a URL to update')
            else:
                if args.type == 'delete':
                    req_type = 'URL_DELETED'
                else:
                    req_type = 'URL_UPDATED'
                update_single_url(args.url, req_type, credentials)
        except Exception as e:
            print(f"An error occurred: {e}")

    if args.mode == 'batch':
        try:
            if args.batch_file is None:
                print('Please provide a path to the batch file')
            else:
                with open(args.batch_file, 'r') as file:
                    batch = file.readlines()
                    update_batch_url(batch, credentials)
        except Exception as e:
            print(f"An error occurred: {e}")

Try it like this (with your URL)

python main.py --mode single --url https://sentienttechnology.pro/ --type update

Automation

Now that we have built the basic API functions, we can move on to considering automation. It will be polling app, that checks the sitemap periodically. If changes are detected, the new URLs will be sent to the API using the update_singe_url() function.

This is a simple approach that does not need backend integration, which is positive. The downside is a small delay in the request (depending on the delay you set) and the fact that it's dependent on sitemap and robots.txt (from where we discover the sitemap) being functional. We'll write a bit of code to try and adjust for errors, but it's important to follow up on the logs that will be created by the program.

Additionally, while sitemaps follow the same format in general, they might differ so there is a possibility you would need to adjust the parsing.

So let's add a function that checks sitemaps and if an update is detected, pushes an update through. Add this point, we're also gonna adhere to best coding practices and using logging instead of print() for logging while the program is running.

Let's begin by importing logging and adding configuration for logs:

Add logging

import logging

if not os.path.exists('logs/'):
    os.makedirs('logs')
formatter = logging.Formatter('%(asctime)s | %(levelname)s | %(message)s')

info_logger = logging.getLogger('info_logger')
info_logger.setLevel(logging.INFO)
info_log_file_handler = logging.FileHandler('logs/info.log')
info_log_file_handler.setFormatter(formatter)
info_logger.addHandler(info_log_file_handler)

error_logger = logging.getLogger('error_logger')
error_logger.setLevel(logging.ERROR)
error_log_file_handler = logging.FileHandler('logs/error.log')
error_log_file_handler.setFormatter(formatter)
error_logger.addHandler(error_log_file_handler)

error_log_console_handler = logging.StreamHandler()
error_log_console_handler.setLevel(logging.ERROR)
error_log_console_handler.setFormatter(formatter)
error_logger.addHandler(error_log_console_handler)

Now you can change all the print functions in the code to info and error logs. You're free to do it as you want, but here is an example:

updated update_single_url() with logging

def update_single_url(url, req_type, creds):
    try:
        service = build('indexing', 'v3', credentials=creds)
        callback = {
            "url": url,
            "type": req_type
        }
        update_single = service.urlNotifications().publish(body=callback).execute()
        info_logger.info(update_single.get("urlNotificationMetadata"))
    except HttpError as err:
        error_logger.error(err)
    time.sleep(0.5)

Feel free to play around with formatting, whatever works best for your own process. In general though, I recommend keeping log messages conscise and informative so they are useful (and don't fill in your hard drive)

We're gonna need to install some extra libraries here first to fetch and process the sitemap:

Extra dependencies

pip install requests bs4 lxml pandas

Now we create a bit more complex function, that does several things. On first run, it fetches the sitemap and saves it to a csv file. On consequent runs, it compares the two sitemaps and requests indexing from URLs that have a new date on them.

Fetch and parse sitemap, check for diff and ping Google

def check_sitemap_for_changes(sitemap_url, creds):
    def parse_sitemap_to_dataframe(sitemap_content):
        soup = BeautifulSoup(sitemap_content, 'xml')
        url_elements = soup.find_all('url')
        urls = []
        last_updated = []
        for url_element in url_elements:
            loc = url_element.find('loc').text.strip()
            urls.append(loc)
            lastmod = url_element.find('lastmod')
            if lastmod:
                last_updated.append(lastmod.text.strip())
            else:
                last_updated.append(None)
        data = {'URL': urls, 'loc': last_updated}
        return pd.DataFrame(data)

    def fetch_sitemap(url):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                return response.text
            else:
                error_logger.error(f"Failed to fetch sitemap. Status code: {response.status_code}")
                return None
        except Exception as e:
            error_logger.error(e)
            return None

    def save_dataframe_to_csv(dataframe, file_name):
        dataframe.to_csv(file_name, index=False)

    domain = urlparse(sitemap_url).netloc
    csv_file_name = f"sitemaps/{domain}_sitemap.csv"
    sitemap_content = fetch_sitemap(sitemap_url)
    if sitemap_content:
        main_dataframe = parse_sitemap_to_dataframe(sitemap_content)
        sitemapindex_elements = BeautifulSoup(sitemap_content, 'xml').find_all('sitemapindex')
        if sitemapindex_elements:
            for sitemapindex_element in sitemapindex_elements:
                sub_sitemap_elements = sitemapindex_element.find_all('loc')
                for sub_sitemap_element in sub_sitemap_elements:
                    sub_sitemap_url = sub_sitemap_element.text.strip()
                    sub_sitemap_content = fetch_sitemap(sub_sitemap_url)
                    if sub_sitemap_content:
                        sub_dataframe = parse_sitemap_to_dataframe(sub_sitemap_content)
                        main_dataframe = pd.concat([main_dataframe, sub_dataframe], ignore_index=True)

        if os.path.exists(csv_file_name):
            existing_df = pd.read_csv(csv_file_name)
            main_dataframe.drop_duplicates(subset='URL', keep='first', inplace=True)
            existing_df.drop_duplicates(subset='URL', keep='first', inplace=True)
            merged_df = pd.merge(
                main_dataframe[['URL', 'loc']],
                existing_df[['URL', 'loc']],
                on='URL', suffixes=('_new', '_old')
            )
            changed_rows = merged_df[merged_df['loc_old'] != merged_df['loc_new']]
            if not changed_rows.empty:
                for index, row in changed_rows.iterrows():
                    update_single_url(row['URL'], 'URL_UPDATED', creds)
                    if not os.path.exists('sitemaps'):
                        os.makedirs('sitemaps')
                    save_dataframe_to_csv(main_dataframe, csv_file_name)
            else:
                info_logger.info(f"No changes detected in {sitemap_url}")

Now we update the main loop with new command to run autopilot. We still use the print function here for manual actions as it's easier and doesn't inflate the logs. But we also write some things into log as well. Again, feel free to modify it as per your preferences

Updated main loop

if __name__ == '__main__':
    credentials = authenticate()
    polling_period = 10
    parser = ArgumentParser()
    parser.add_argument('--mode', required=True, choices=['single', 'batch', 'autopilot'],
                        type=str, help='single or batch')
    parser.add_argument('--followed-sitemaps', type=str, help='Path to txt file with followed domains')
    parser.add_argument('--url', type=str, help='URL to be updated')
    parser.add_argument('--type', choices=['update', 'delete'], type=str, help='Type of update')
    parser.add_argument('--batch-file', type=str, help='Path to txt file with batch of URLs')
    args = parser.parse_args()
    if args.mode == 'single':
        info_logger.info("Single mode started")
        try:
            if args.type is None:
                print('Please provide a type of update')
            elif args.url is None:
                print('Please provide a URL to update')
            else:
                if args.type == 'delete':
                    req_type = 'URL_DELETED'
                else:
                    req_type = 'URL_UPDATED'
                update_single_url(args.url, req_type, credentials)
        except Exception as e:
            print(f"An error occurred: {e}")
            error_logger.error(e)
    if args.mode == 'batch':
        info_logger.info("Batch mode started")
        try:
            if args.batch_file is None:
                print('Please provide a path to the batch file')
            else:
                with open(args.batch_file, 'r') as file:
                    batch = file.readlines()
                    update_batch_url(batch, credentials)
        except Exception as e:
            print(f"An error occurred: {e}")
            error_logger.error(e)
    if args.mode == 'autopilot':
        info_logger.info("Autopilot mode started")
        try:
            if args.followed_sitemaps is None:
                print('Please provide a path to the followed domains file')
            else:
                while True:
                    with open(args.followed_sitemaps, 'r') as file:
                        followed_sitemaps = file.readlines()
                        for sitemap in followed_sitemaps:
                            check_sitemap_for_changes(sitemap, credentials)
                    time.sleep(polling_period)
        except Exception as e:
            print(f"An error occurred: {e}")
            error_logger.error(e)

And that's it!

Afterword

This program has a lot of possibility for extension - you can scope it to only consider it certain URLs or also add a feature to automatically request deletion for deleted URLs. However, I would suggest that this is always done manually - just in case there is problem with some sub-sitemap, you won't face an issue where program requests deletion of URLs you want to keep.

One of the things to note is, that the daily limit for requests is only 200. So, if you run a set of sites that are close to or over that limit, it will cause issues.

Get access to the code

It takes time and effort to create these articles and share my work. I've uploaded this code to a private GitHub repository.

To gain access to it, leave a like and share the post about this article I've published on LinkedIn. Drop a comment "Shared" and I will message you couple days time so we get you access.

You can contact me on LinkedIn or through the contact form

Back to index

Copied to clipboard