How to alter XML shopping feeds with Python


Posted by BigEyeGuy on 2018-10-03 20:27:49

Most Google XML shopping feeds include out of stock products that must be removed manually via a third party management tool. However, each time a feed is manually updated, the third party tool charges a fee.

We therefore needed a way of altering the live feed to remove all out of stock items. The benefit of doing so would be quicker feed processing times, less clouding of actual product counts and finally, easier access to large feeds.

In order to do this I wrote the below script, which I’ll take you through step by step. This is hosted via PythonAnywhere and does the following:

  • It scrapes a live XML shopping feed every day via a scheduled task and makes a copy.
  • It then serves this copy on the BigEyeGuy blog.
  • It then removes all items from this XML file that are out of stock.
  • Finally, it gives the person in charge of the feed the option to remove further items by specifying the product ID in a GoogleSheet.

I decided to use python’s ElementTree library because, having never manipulated XML data in the past, this was the first result that I found when Googling, and the documentation was immediately very helpful.

The requests library is used to scrape the live feed, and the gspread and oauth2client libraries are used to speak to the Google Drive API.

I firstly define the URL we want to scrape and then make a copy of that within the static folder of the BEG blog. To ensure the copy is complete before the rest of the script runs, I delay the script by 3 seconds.

  
import xml.etree.ElementTree as ET
import gspread
import requests
import time
from oauth2client.service_account import ServiceAccountCredentials

URL = "https://www.example.co.uk/Google-Products-Xml.php"
response = requests.get(URL)
with open('/home/bigeyeguy/mysite/static/example-feed.xml', 'wb') as file:
    file.write(response.content)

time.sleep(3)
  

We then use ElementTree to parse the copied XML file before finding the root of the tree. The parent map variable is a way of easily referring to parent nodes, which is something that ET cannot do as standard.

Since this particular XML file uses namespaces, we also need a variable, ns, to deal with this.

  
tree = ET.parse('/home/bigeyeguy/mysite/static/example-feed.xml')
root = tree.getroot()
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
ns = {'BEG': 'http://base.google.com/ns/1.0'}
  

For info on how to interact with Google APIs using oauth2, and create a client_secret.json file, see my blog on using the Google Analytics API. Once you have this, I use the scope variable to house the APIs I want the script to interact with, and creds is used to refer to the client_secret json file. As with any Google API that uses oauth2, you’ll need to allow the script permission to access your specified Google account the first time you run it.

  
scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
  

Now we can start to use gspread to manipulate a Google sheet. Define the client variable as below, and then either create a Google sheet manually within your Google drive, or do it as I have below in the commented lines. You need to open your Gsheet and then refer to the worksheet (tab) you’d like to use. The idea with this Gsheet is that the user will enter the item IDs into the first column and these will be removed from the feed every time the script runs. Therefore, we define this first column using the remove_ids variable.

  
client = gspread.authorize(creds)
#sh = client.create('REMOVE_IDS')
#sh.share('you@email.co.uk', perm_type='user', role='writer')
sh = client.open('REMOVE_IDS')
worksheet = sh.get_worksheet(0)
remove_ids = worksheet.col_values(1)
  

Now for the fun part. We need a way of referring first to the object of each item within the xml feed. Note that with this, we are not yet referencing the contents of ; only the object itself.

We can the loop through these objects and for each within the , we ask whether the id matches any defined within the first column of our Gsheet. If this is the case, we remove not only the , but the entire parent object (and all child nodes) using the parent_map variable we defined earlier. Note that this is where we use our namespace variable.

  
items = tree.findall('.//item')
for item in items:
    id = item.findall('BEG:id', ns)
    # again above is the object of id, not its contents. id is a list object
    for i in id:
        for cell in remove_ids:
            if i.text == cell:
                parent_map[item].remove(item)
  

Now that we’ve checked against and removed items with ids that match the Gsheet, we simply need to check the node (a child node of , and wherever this is equal to ‘Out of Stock’, we again remove the entire .

  
for item in items:
    avail = item.findall(BEG:availability', ns)
    for i in avail:
        if i.text == "Out of stock":
            parent_map[item].remove(item)
  

The final action is to rewrite our new XML tree, with any unwanted items removed, to the XML file, as below. This then sits as a clean XML file for Google to reference instead of the messy version on your client’s site.

  
tree.write('/home/bigeyeguy/mysite/static/example-feed.xml')
  

Posted by BigEyeGuy on 2018-10-03 20:27:49