Using Python3 to scrape xpath values for thousands of URLs


Posted by BigEyeGuy on 2019-04-01 19:47:06

This post talks through a python3 script that does three things:

  1. Access a Google Sheet containing a column with thousands of URLs

  2. For each of these URLs, return a value from a defined xpath.

  3. Write the value to the cell next to each corresponding URL in the same Google Sheet.


import requests
import gspread
from oauth2client.service_account import ServiceAccountCredentials
from lxml import html
import time

start_time = time.time()

scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']

creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)

client = gspread.authorize(creds)

So as above, the first thing we'll do is import the libraries we need and setup oauth2 for identification. If you need help with that, see this link:

https://gspread.readthedocs.io/en/latest/oauth2.html

If you run this and get an error saying something about gspread.authorize(creds) needing to use bytes and not str, update your version of httplib2, using the 'pip install --update httplib2' command.


sh = client.open('ExampleSheetName')
worksheet = sh.get_worksheet(0)
url_list = worksheet.col_values(4)
cell_list = worksheet.range('C1:C8325')

Then we'll define the name of the Google sheet we'll be working on, and which worksheet (tab) we'll be working on. In this case it's the first tab, index zero.

We'll also define which column we'll pull our URLs from (fourth) and the range of cells we'll be writing our values to (cell_list).


def grab_xpathtext(urls, cell_range):
    statuses = []
    for url in urls:
        try:
            r = requests.get(url)
            tree = html.fromstring(r.content)
            adGroupStatus = tree.xpath('//*[@id="m_count"]/text()')
            if len(adGroupStatus) == 0:
                statuses.append('Not Found')
            else:
                statuses.append(adGroupStatus[0])
        except:
            pass
    for cell in cell_range:
        try:
            for i, val in enumerate(statuses):
                cell_range[i].value = val
        except:
            pass
    worksheet.update_cells(cell_range)

grab_xpathtext(url_list, cell_list)
print("Program took", time.time() - start_time, "to run")

Now for our function. We start with an empty list and our first parameter will use the 'url_list' variable. For each URL in this list, we'll use the requests and lxml libraries to pull the xpath value. we've done a manual check, so we know the xpath is always the same for all URLs.

We now want to write these to our empty list, but some URLs will return 404 http status codes, so the xpath would not exist. We get around this by printing 'Not found' in place of the xpath value for applicable URLs.

We then use a for loop to write each value from our 'statuses' list to the cell next to the correct URL and we call the function. Easy, and I've included a little print statement to show how long the script took to run.

To run this, use your terminal or cmd line to jump into the folder where the script lives, ensure you have python3, and type python and the name of your script.

It's worth noting that in its current form, this is very inefficient and relies on the memory of the terminal/cmd line. This script will deal with a thousand or so URLs, but any more than that and we'd need to batch the requests somehow. I don't know how to do that yet (or even if that's the correct terminology), but maybe I can revisit this in a future post as I learn more.


Posted by BigEyeGuy on 2019-04-01 19:47:06