How To Kinda Automate A Technical SEO Review Using DeepCrawl & Python


Posted by BigEyeGuy on 2020-08-06 19:37:13

I'm a big fan of DeepCrawl, but it has one issue that I hate. For each issue area listed, DeepCrawl requires the user to click into each and download a csv.

If you're an agency, these are the issues that make up your technical audit, and if you're in house, this is the info you need to provide to web developers so that they can fix the issues. Ideally you want a fast way to get all these issues out of DeepCrawl and into a spreadsheet in different tabs.

I've therefore written a small script (below) using the wonderful Pandas library which works as follows. Rather than clicking through to each issue page in DeepCrawl, you need only click through to the 'All Pages' section and download this report (as an .xls file; convert is needed).

This report should be kept in the same folder as your script, and is referenced in the script under the 'Location' variable.

For now the script will output the below issue areas into separate tabs in an xlsx file:

  • all meta data for pages with a 200 HTTP code
  • all broken links
  • all pages sorted by shortest content length (I usually delete everything with more than 300 words)
  • all non indexed pages
  • all non-301 redirected pages
  • pages in the xml sitemap that shouldn't be
  • pages that should be in the xml sitemap but aren't
  • number of backlinks and referring domains by page
  • all redirect chains

You could expand this to include other issue areas also, but these examples should give you enough info for now. ScreamingFrog also has a full export feature, so this script could easily be adapted for this also.



import pandas as pd
import sys
import numpy as np
import time

#DOWNLOAD MUST BE ALL PAGES FILE FROM DEEPCRAWL

Location = r"/Users/name/auto-techaudit/INPUT.xls"
df = pd.read_excel(Location)
writer = pd.ExcelWriter(r"/Users/name/auto-techaudit/OUTPUT.xls", engine='xlsxwriter')


def autoTechAudit(dataframe, writer):

    # all meta data
    metadatadf = dataframe.filter(['url''page_title', 'description', 'http_status_code', 'h1_tag'], axis=1)
    is_200_code = metadatadf['http_status_code']==200
    metadatadf2 = metadatadf[is_200_code]
    metadatadf2.to_excel(writer, sheet_name = 'Meta Data')
    # end all meta data

    # broken links & 404 errors
    brokenlinksdf = dataframe.filter(['url', 'http_status_code', 'found_at_url', 'found_in_sitemap', 'found_in_webcrawl', 'found_in_google_search_console', 'found_in_backlinks'], axis=1)
    is_url_a_404 = brokenlinksdf['http_status_code']==404
    brokenlinksdf2 = brokenlinksdf[is_url_a_404]
    brokenlinksdf2.to_excel(writer, sheet_name = "Broken Links & 404s")
    # end broken links & 404 errors

    # thin pages
    thingpagesdf = dataframe.filter(['url', 'http_status_code', 'canonical_url', 'word_count'], axis=1)
    is_200_code = thingpagesdf['http_status_code']==200
    thingpagesdf2 = thingpagesdf[is_200_code]
    thingpagesdf3 = thingpagesdf2.sort_values(by=['word_count'])
    thingpagesdf3.to_excel(writer, sheet_name = 'Thin Pages')
    # end thin pages

    # Not indexed pages
    notindexeddf = dataframe.filter(['url', 'http_status_code', 'indexable', 'canonical_url', 'redirected_to_url', 'robots_noindex'], axis=1)
    is_indexable_false = notindexeddf['indexable']==0
    notindexeddf2 = notindexeddf[is_indexable_false]
    is_200_code = notindexeddf2['http_status_code']==200
    notindexeddf3 = notindexeddf2[is_200_code]
    notindexeddf3.to_excel(writer, sheet_name='Not Indexed Pages')
    # End not indexed pages

    # non 301 redirects
    non301redirectsdf = dataframe.filter(['url', 'http_status_code', 'found_in_sitemap', 'redirected_to_url'], axis=1)
    nice_http_codes = [200, 301]
    non301redirectsdf2 = non301redirectsdf[~non301redirectsdf.http_status_code.isin(nice_http_codes)]
    non301redirectsdf2.to_excel(writer, sheet_name = 'Non 301 Redirects')
    # end non 301 redirects

    # url is in sitemap but does not return a 200 http status code
    sitemap_non200statusdf = dataframe.filter(['url', 'http_status_code', 'found_in_sitemap', 'redirected_to_url'], axis=1)
    is_not_200 = sitemap_non200statusdf['http_status_code']!=200
    sitemap_non200statusdf2 = sitemap_non200statusdf[is_not_200]
    is_found_in_sitemap = sitemap_non200statusdf2['found_in_sitemap']==1
    sitemap_non200statusdf3 = sitemap_non200statusdf2[is_found_in_sitemap]
    sitemap_non200statusdf3.to_excel(writer, sheet_name = 'Sitemap Non 200 Status')
    # end url is in sitemap but does not return a 200 http status code

    # url has 200 http code but is not in sitemap
    urls_not_in_sitemap_df = dataframe.filter(['url', 'http_status_code', 'found_in_sitemap', 'canonical_url'], axis=1)
    is_200_code = urls_not_in_sitemap_df['http_status_code']==200
    urls_not_in_sitemap_df2 = urls_not_in_sitemap_df[is_200_code]
    not_found_in_sitemap = urls_not_in_sitemap_df2['found_in_sitemap']==0
    urls_not_in_sitemap_df3 = urls_not_in_sitemap_df2[not_found_in_sitemap]
    urls_not_in_sitemap_df3.to_excel(writer, sheet_name='200 page not in sitemap')
    # end url has 200 http code but is not in sitemap

    # backlinks and ref domains
    linksdf = dataframe.filter(['url', 'http_status_code', 'backlink_count', 'backlink_domain_count'])
    linksdf2 = linksdf.sort_values(by=['backlink_domain_count'], ascending = False)
    linksdf2.to_excel(writer, sheet_name="Link Profile")
    # end backlinks and ref domains

    # redirect chains
    chainsdf = dataframe.filter(['url', 'redirect_count', 'redirect_chain'])
    morethan2redirects = chainsdf['redirect_count'] >=3
    chainsdf2 = chainsdf[morethan2redirects]
    chainsdf2.to_excel(writer, sheet_name="Redirect Chains")
    # end redirect chains

    writer.save()

autoTechAudit(df, writer)



Posted by BigEyeGuy on 2020-08-06 19:37:13