Tumgik
#EDIT: now has URLs since mobile has trouble with redirect links
omnificent-orion · 2 years
Photo
Tumblr media
The Quest For a New Computer
The 2010 Dell Inspiron desktop I migrated to a few years back and have been using since is still chugging along, but it is well past showing its age. It’s clear that my art career has hit an unmovable wall because of technological limitations, so now I have no real options beside saving enough money to build a new machine. Here’s a post about how you can help me out.
I’m currently available for several different types of commissions
At ko-fi.com/loxocosm/ you can get sketch style commissions
At inprnt.com/gallery/loxocosm/ you can get prints on-demand
At patreon.com/loxocosm/ my $1 tier has several years of backlog of full resolution images, process shots, and commentary
And finally, reblogging this post is the easiest way to help me: it costs nothing, and it fills me with feelings of hope! Thank you so much for sharing.
194 notes · View notes
digital-strategy · 7 years
Link
http://ift.tt/2nIeskH
Posted by Everett
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
via SEOmoz Daily SEO Blog
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
identityshine · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
How to Do a Content Audit [Updated for 2017] posted first on http://ift.tt/2maTWEr
0 notes
ubizheroes · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js //
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It’s been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really…
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I’m not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the “supplemental index” of yesteryear. But that’s another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you’re having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there’s a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure —> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you’re only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It’s vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It’s easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it’s not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME – MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: example.com/page#!key=value (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We’ve noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that’s the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that’s the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I’ve done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you’ve made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here’s where the fun really begins. In a large organization, it’s tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it’s ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That’s all fine, as long as you’re working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate…
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a “linkbait” piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn’t answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., “No content is here yet, but if you sign in and leave some user-generated-content, then we’ll have content here for the next guy.” By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth…
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting “Improve” in the Action column, elaborate in the Details column:
“Improve these pages by writing unique, useful content to improve the Copyscape risk score.”
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn’t worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting “Remove” from the Action column, elaborate in the Details column:
“Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap.”
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as “Improve”?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as “Consolidate”?
When you have overlapping topics that don’t provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as “Improve” and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as “Consolidate” and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single “Evergreen” landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 —> Best Sellers).
Which of these pages should be marked as “Remove”?
Pages with poor link, traffic, and social metrics related to low-quality content that isn’t worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn’t worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as “Leave As-Is”?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it’s best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you’ll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn’t be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn’t suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn’t suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn’t suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the “Page Type” is known (i.e., it’s in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn’t suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the “Page Type” is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or…
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to “Remove” and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”… Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow’s content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 “Other” pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to “improve or remove” content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the “Details” column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the “Details” column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
from Moz Blog https://moz.com/blog/content-audit via IFTTT
from Blogger http://imlocalseo.blogspot.com/2017/03/how-to-do-content-audit-updated-for-2017.html via IFTTT
from IM Local SEO https://imlocalseo.wordpress.com/2017/03/22/how-to-do-a-content-audit-updated-for-2017/ via IFTTT
from Gana Dinero Colaborando | Wecon Project https://weconprojectspain.wordpress.com/2017/03/22/how-to-do-a-content-audit-updated-for-2017/ via IFTTT
from WordPress https://mrliberta.wordpress.com/2017/03/22/how-to-do-a-content-audit-updated-for-2017/ via IFTTT
0 notes
nereomata · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: example.com/page#!key=value (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
from Moz Blog https://moz.com/blog/content-audit via IFTTT from IM Local SEO Blog http://imlocalseo.blogspot.com/2017/03/how-to-do-content-audit-updated-for-2017.html via IFTTT from Blogger http://nereomata.blogspot.com/2017/03/how-to-do-content-audit-updated-for-2017.html via IFTTT
0 notes
neilmberry · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
How to Do a Content Audit [Updated for 2017] published first on http://elitelimobog.blogspot.com
0 notes
holmescorya · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
robertmcraft · 7 years
Text
How to Do a Content Audit [Updated for 2017]
Posted by Everett
//<![CDATA[ (function($) { // code using $ as alias to jQuery $(function() { // Hide the hypotext content. $('.hypotext-content').hide(); // When a hypotext link is clicked. $('a.hypotext.closed').click(function (e) { // custom handling here e.preventDefault(); // Create the class reference from the rel value. var id = '.' + $(this).attr('rel'); // If the content is hidden, show it now. if ( $(id).css('display') == 'none' ) { $(id).show('slow'); if (jQuery.ui) { // UI loaded $(id).effect("highlight", {}, 1000); } } // If the content is shown, hide it now. else { $(id).hide('slow'); } }); // If we have a hash value in the url. if (window.location.hash) { // If the anchor is within a hypotext block, expand it, by clicking the // relevant link. console.log(window.location.hash); var anchor = $(window.location.hash); var hypotextLink = $('#' + anchor.parents('.hypotext-content').attr('rel')); console.log(hypotextLink); hypotextLink.click(); // Wait until the content has expanded before jumping to anchor. //$.delay(1000); setTimeout(function(){ scrollToAnchor(window.location.hash); }, 1000); } }); function scrollToAnchor(id) { var anchor = $(id); $('html,body').animate({scrollTop: anchor.offset().top},'slow'); } })(jQuery); //]]>
This guide provides instructions on how to do a content audit using examples and screenshots from Screaming Frog, URL Profiler, Google Analytics (GA), and Excel, as those seem to be the most widely used and versatile tools for performing content audits.
{Expand for more background}
It's been almost three years since the original “How to do a Content Audit – Step-by-Step” tutorial was published here on Moz, and it’s due for a refresh. This version includes updates covering JavaScript rendering, crawling dynamic mobile sites, and more.
It also provides less detail than the first in terms of prescribing every step in the process. This is because our internal processes change often, as do the tools. I’ve also seen many other processes out there that I would consider good approaches. Rather than forcing a specific process and publishing something that may be obsolete in six months, this tutorial aims to allow for a variety of processes and tools by focusing more on the basic concepts and less on the specifics of each step.
We have a DeepCrawl account at Inflow, and a specific process for that tool, as well as several others. Tapping directly into various APIs may be preferable to using a middleware product like URL Profiler if one has development resources. There are also custom in-house tools out there, some of which incorporate historic log file data and can efficiently crawl websites like the New York Times and eBay. Whether you use GA or Adobe Sitecatalyst, Excel, or a SQL database, the underlying process of conducting a content audit shouldn’t change much.
TABLE OF CONTENTS
What is an SEO content audit?
What is the purpose of a content audit?
How & why “pruning” works
How to do a content audit
The inventory & audit phase
Step 1: Crawl all indexable URLs
Crawling roadblocks & new technologies
Crawling very large websites
Crawling dynamic mobile sites
Crawling and rendering JavaScript
Step 2: Gather additional metrics
Things you don’t need when analyzing the data
The analysis & recommendations phase
Step 3: Put it all into a dashboard
Step 4: Work the content audit dashboard
The reporting phase
Step 5: Writing up the report
Content audit resources & further reading
What is a content audit?
A content audit for the purpose of SEO includes a full inventory of all indexable content on a domain, which is then analyzed using performance metrics from a variety of sources to determine which content to keep as-is, which to improve, and which to remove or consolidate.
What is the purpose of a content audit?
A content audit can have many purposes and desired outcomes. In terms of SEO, they are often used to determine the following:
How to escape a content-related search engine ranking filter or penalty
Content that requires copywriting/editing for improved quality
Content that needs to be updated and made more current
Content that should be consolidated due to overlapping topics
Content that should be removed from the site
The best way to prioritize the editing or removal of content
Content gap opportunities
Which content is ranking for which keywords
Which content should be ranking for which keywords
The strongest pages on a domain and how to leverage them
Undiscovered content marketing opportunities
Due diligence when buying/selling websites or onboarding new clients
While each of these desired outcomes and insights are valuable results of a content audit, I would define the overall “purpose” of one as:
The purpose of a content audit for SEO is to improve the perceived trust and quality of a domain, while optimizing crawl budget and the flow of PageRank (PR) and other ranking signals throughout the site.
Often, but not always, a big part of achieving these goals involves the removal of low-quality content from search engine indexes. I’ve been told people hate this word, but I prefer the “pruning” analogy to describe the concept.
How & why “pruning” works
{Expand for more on pruning}
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove. Optimizing crawl budget and the flow of PR is self-explanatory to most SEOs. But how does a content audit improve the perceived trust and quality of a domain? By removing low-quality content from the index (pruning) and improving some of the content remaining in the index, the likelihood that someone arrives on your site through organic search and has a poor user experience (indicated to Google in a variety of ways) is lowered. Thus, the quality of the domain improves. I’ve explained the concept here and here.
Others have since shared some likely theories of their own, including a larger focus on the redistribution of PR.
Case study after case study has shown the concept of “pruning” (removing low-quality content from search engine indexes) to be effective, especially on very large websites with hundreds of thousands (or even millions) of indexable URLs. So why do content audits work? Lots of reasons. But really...
Does it matter?
¯\_(ツ)_/¯
How to do a content audit
Just like anything in SEO, from technical and on-page changes to site migrations, things can go horribly wrong when content audits aren’t conducted properly. The most common example would be removing URLs that have external links because link metrics weren’t analyzed as part of the audit. Another common mistake is confusing removal from search engine indexes with removal from the website.
Content audits start with taking an inventory of all content available for indexation by search engines. This content is then analyzed against a variety of metrics and given one of three “Action” determinations. The “Details” of each Action are then expanded upon.
The variety of combinations of options between the “Action” of WHAT to do and the “Details” of HOW (and sometimes why) to do it are as varied as the strategies, sites, and tactics themselves. Below are a few hypothetical examples:
You now have a basic overview of how to perform a content audit. More specific instructions can be found below.
The process can be roughly split into three distinct phases:
Inventory & audit
Analysis & recommendations
Summary & reporting
The inventory & audit phase
Taking an inventory of all content, and related metrics, begins with crawling the site.
One difference between crawling for content audits and technical audits:
Technical SEO audit crawls are concerned with all crawlable content (among other things).
Content audit crawls for the purpose of SEO are concerned with all indexable content.
{Expand for more on crawlable vs. indexable content}
The URL in the image below should be considered non-indexable. Even if it isn’t blocked in the robots.txt file, with a robots meta tag, or an X-robots header response –– even if it is frequently crawled by Google and shows up as a URL in Google Analytics and Search Console –– the rel =”canonical” tag shown below essentially acts like a 301 redirect, telling Google not to display the non-canonical URL in search results and to apply all ranking calculations to the canonical version. In other words, not to “index” it.
I'm not sure “index” is the best word, though. To “display” or “return” in the SERPs is a better way of describing it, as Google surely records canonicalized URL variants somewhere, and advanced site: queries seem to show them in a way that is consistent with the "supplemental index" of yesteryear. But that's another post, more suitably written by a brighter mind like Bill Slawski.
A URL with a query string that canonicalizes to a version without the query string can be considered “not indexable.”
A content audit can safely ignore these types of situations, which could mean drastically reducing the amount of time and memory taken up by a crawl.
Technical SEO audits, on the other hand, should be concerned with every URL a crawler can find. Non-indexable URLs can reveal a lot of technical issues, from spider traps (e.g. never-ending empty pagination, infinite loops via redirect or canonical tag) to crawl budget optimization (e.g. How many facets/filters deep to allow crawling? 5? 6? 7?) and more.
It is for this reason that trying to combine a technical SEO audit with a content audit often turns into a giant mess, though an efficient idea in theory. When dealing with a lot of data, I find it easier to focus on one or the other: all crawlable URLs, or all indexable URLs.
Orphaned pages (i.e., with no internal links / navigation path) sometimes don’t turn up in technical SEO audits if the crawler had no way to find them. Content audits should discover any indexable content, whether it is linked to internally or not. Side note: A good tech audit would do this, too.
Identifying URLs that should be indexed but are not is something that typically happens during technical SEO audits.
However, if you're having trouble getting deep pages indexed when they should be, content audits may help determine how to optimize crawl budget and herd bots more efficiently into those important, deep pages. Also, many times Google chooses not to display/index a URL in the SERPs due to poor content quality (i.e., thin or duplicate).
All of this is changing rapidly, though. URLs as the unique identifier in Google’s index are probably going away. Yes, we’ll still have URLs, but not everything requires them. So far, the word “content” and URL has been mostly interchangeable. But some URLs contain an entire application’s worth of content. How to do a content audit in that world is something we’ll have to figure out soon, but only after Google figures out how to organize the web’s information in that same world. From the looks of things, we still have a year or two.
Until then, the process below should handle most situations.
Step 1: Crawl all indexable URLs
A good place to start on most websites is a full Screaming Frog crawl. However, some indexable content might be missed this way. It is not recommended that you rely on a crawler as the source for all indexable URLs.
In addition to the crawler, collect URLs from Google Analytics, Google Webmaster Tools, XML Sitemaps, and, if possible, from an internal database, such as an export of all product and category URLs on an eCommerce website. These can then be crawled in “list mode” separately, then added to your main list of URLs and deduplicated to produce a more comprehensive list of indexable URLs.
Some URLs found via GA, XML sitemaps, and other non-crawl sources may not actually be “indexable.” These should be excluded. One strategy that works here is to combine and deduplicate all of the URL “lists,” and then perform a crawl in list mode. Once crawled, remove all URLs with robots meta or X-Robots noindex tags, as well as any URL returning error codes and those that are blocked by the robots.txt file, etc. At this point, you can safely add these URLs to the file containing indexable URLs from the crawl. Once again, deduplicate the list.
Crawling roadblocks & new technologies
Crawling very large websites
First and foremost, you do not need to crawl every URL on the site. Be concerned with indexable content. This is not a technical SEO audit.
{Expand for more about crawling very large websites}
Avoid crawling unnecessary URLs
Some of the things you can avoid crawling and adding to the content audit in many cases include:
Noindexed or robots.txt-blocked URLs
4XX and 5XX errors
Redirecting URLs and those that canonicalize to a different URL
Images, CSS, JavaScript, and SWF files
Segment the site into crawlable chunks
You can often get Screaming Frog to completely crawl a single directory at a time if the site is too large to crawl all at once.
Filter out URL patterns you plan to remove from the index
Let’s say you’re auditing a domain on WordPress and you notice early in the crawl that /tag/ pages are indexable. A quick site:domain.com inurl:tag search on Google tells you there are about 10 million of them. A quick look at Google Analytics confirms that URLs in the /tag/ directory are not responsible for very much revenue from organic search. It would be safe to say that the “Action” on these URLs should be “Remove” and the “Details” should read something like this: Remove /tag/ URLs from the indexed with a robots noindex,follow meta tag. More advice on this strategy can be found here.
Upgrade your machine
Install additional RAM on your computer, which is used by Screaming Frog to hold data during the crawl. This has the added benefit of improving Excel performance, which can also be a major roadblock.
You can also install Screaming Frog on Amazon Web Server (AWS), as described in this post on iPullRank.
Tune up your tools
Screaming Frog provides several ways for SEOs to get more out of the crawler. This includes adjusting the speed, max threads, search depth, query strings, timeouts, retries, and the amount of RAM available to the program. Leave at least 3GB off limits to the spider to avoid catastrophic freezing of the entire machine and loss of data. You can learn more about tuning up Screaming Frog here and here.
Try other tools
I’m convinced that there's a ton of wasted bandwidth on most content audit projects due to strategists releasing a crawler and allowing it to chew through an entire domain, whether the URLs are indexable or not. People run Screaming Frog without saving the crawl intermittently, without adding more RAM availability, without filtering out the nonsense, or using any of the crawl customization features available to them.
That said, sometimes SF just doesn’t get the job done. We also have a process specific to DeepCrawl, and have used Botify, as well as other tools. They each have their pros and cons. I still prefer Screaming Frog for crawling and URL Profiler for fetching metrics in most cases.
Crawling dynamic mobile sites
This refers to a specific type of mobile setup in which there are two code-bases –– one for mobile and one for desktop –– but only one URL. Thus, the content of a single URL may vary significantly depending on which type of device is visiting that URL. In such cases, you will essentially be performing two separate content audits. Proceed as usual for the desktop version. Below are instructions for crawling the mobile version.
{Expand for more on crawling dynamic websites}
Crawling a dynamic mobile site for a content audit will require changing the User-Agent of the crawler, as shown here under Screaming Frog’s “Configure ---> HTTP Header” menu:
The important thing to remember when working on mobile dynamic websites is that you're only taking an inventory of indexable URLs on one version of the site or the other. Once the two inventories are taken, you can then compare them to uncover any unintentional issues.
Some examples of what this process can find in a technical SEO audit include situations in which titles, descriptions, canonical tags, robots meta, rel next/prev, and other important elements do not match between the two versions of the page. It's vital that the mobile and desktop version of each page have parity when it comes to these essentials.
It's easy for the mobile version of a historically desktop-first website to end up providing conflicting instructions to search engines because it's not often “automatically changed” when the desktop version changes. A good example here is a website I recently looked at with about 20 million URLs, all of which had the following title tag when loaded by a mobile user (including Google): BRAND NAME - MOBILE SITE. Imagine the consequences of that once a mobile-first algorithm truly rolls out.
Crawling and rendering JavaScript
One of the many technical issues SEOs have been increasingly dealing with over the last couple of years is the proliferation of websites built on JavaScript frameworks and libraries like React.js, Ember.js, and Angular.js.
{Expand for more on crawling Javascript websites}
Most crawlers have made a lot of progress lately when it comes to crawling and rendering JavaScript content. Now, it’s as easy as changing a few settings, as shown below with Screaming Frog.
When crawling URLs with #! , use the “Old AJAX Crawling Scheme.” Otherwise, select “JavaScript” from the “Rendering” tab when configuring your Screaming Frog SEO Spider to crawl JavaScript websites.
How do you know if you’re dealing with a JavaScript website?
First of all, most websites these days are going to be using some sort of JavaScript technology, though more often than not (so far) these will be rendered by the “client” (i.e., by your browser). An example would be the .js file that controls the behavior of a form or interactive tool.
What we’re discussing here is when the JavaScript is used “server-side” and needs to be executed in order to render the page.
JavaScript libraries and frameworks are used to develop single-page web apps and highly interactive websites. Below are a few different things that should alert you to this challenge:
The URLs contain #! (hashbangs). For example: http://ift.tt/2nQK6ch (AJAX)
Content-rich pages with only a few lines of code (and no iframes) when viewing the source code.
What looks like server-side code in the meta tags instead of the actual content of the tag. For example:
You can also use the BuiltWith Technology Profiler or the Library Detector plugins for Chrome, which shows JavaScript libraries being used on a page in the address bar.
Not all websites built primarily with JavaScript require special attention to crawl settings. Some websites use pre-rendering services like Brombone or Prerender.io to serve the crawler a fully rendered version of the page. Others use isomorphic JavaScript to accomplish the same thing.
Step 2: Gather additional metrics
Most crawlers will give you the URL and various on-page metrics and data, such as the titles, descriptions, meta tags, and word count. In addition to these, you’ll want to know about internal and external links, traffic, content uniqueness, and much more in order to make fully informed recommendations during the analysis portion of the content audit project.
Your process may vary, but we generally try to pull in everything we need using as few sources as possible. URL Profiler is a great resource for this purpose, as it works well with Screaming Frog and integrates easily with all of the APIs we need.
Once the Screaming Frog scan is complete (only crawling indexable content) export the “Internal All” file, which can then be used as the seed list in URL Profiler (combined with any additional indexable URLs found outside of the crawl via GSC, GA, and elsewhere).
This is what my URL Profiler settings look for a typical content audit for a small- or medium-sized site. Also, under “Accounts” I have connected via API keys to Moz and SEMrush.
Once URL Profiler is finished, you should end up with something like this:
Screaming Frog and URL Profiler: Between these two tools and the APIs they connect with, you may not need anything else at all in order to see the metrics below for every indexable URL on the domain.
The risk of getting analytics data from a third-party tool
We've noticed odd data mismatches and sampled data when using the method above on large, high-traffic websites. Our internal process involves exporting these reports directly from Google Analytics, sometimes incorporating Analytics Canvas to get the full, unsampled data from GA. Then VLookups are used in the spreadsheet to combine the data, with URL being the unique identifier.
Metrics to pull for each URL:
Indexed or not?
If crawlers are set up properly, all URLs should be “indexable.”
A non-indexed URL is often a sign of an uncrawled or low-quality page.
Content uniqueness
Copyscape, Siteliner, and now URL Profiler can provide this data.
Traffic from organic search
Typically 90 days
Keep a consistent timeframe across all metrics.
Revenue and/or conversions
You could view this by “total,” or by segmenting to show only revenue from organic search on a per-page basis.
Publish date
If you can get this into Google Analytics as a custom dimension prior to fetching the GA data, it will help you discover stale content.
Internal links
Content audits provide the perfect opportunity to tighten up your internal linking strategy by ensuring the most important pages have the most internal links.
External links
These can come from Moz, SEMRush, and a variety of other tools, most of which integrate natively or via APIs with URL Profiler.
Landing pages resulting in low time-on-site
Take this one with a grain of salt. If visitors found what they want because the content was good, that’s not a bad metric. A better proxy for this would be scroll depth, but that would probably require setting up a scroll-tracking “event.”
Landing pages resulting in Low Pages-Per-Visit
Just like with Time-On-Site, sometimes visitors find what they’re looking for on a single page. This is often true for high-quality content.
Response code
Typically, only URLs that return a 200 (OK) response code are indexable. You may not require this metric in the final data if that's the case on your domain.
Canonical tag
Typically only URLs with a self-referencing rel=“canonical” tag should be considered “indexable.” You may not require this metric in the final data if that's the case on your domain.
Page speed and mobile-friendliness
Again, URL Profiler comes through with their Google PageSpeed Insights API integration.
Before you begin analyzing the data, be sure to drastically improve your mental health and the performance of your machine by taking the opportunity to get rid of any data you don’t need. Here are a few things you might consider deleting right away (after making a copy of the full data set, of course).
Things you don’t need when analyzing the data
{Expand for more on removing unnecessary data}
URL Profiler and Screaming Frog tabs Just keep the “combined data” tab and immediately cut the amount of data in the spreadsheet by about half.
Content Type Filtering by Content Type (e.g., text/html, image, PDF, CSS, JavaScript) and removing any URL that is of no concern in your content audit is a good way to speed up the process.
Technically speaking, images can be indexable content. However, I prefer to deal with them separately for now.
Filtering unnecessary file types out like I've done in the screenshot above improves focus, but doesn’t improve performance very much. A better option would be to first select the file types you don’t want, apply the filter, delete the rows you don’t want, and then go back to the filter options and “(Select All).”
Once you have only the content types you want, it may now be possible to simply delete the entire Content Type column.
Status Code and Status You only need one or the other. I prefer to keep the Code, and delete the Status column.
Length and Pixels You only need one or the other. I prefer to keep the Pixels, and delete the Length column. This applies to all Title and Meta Description columns.
Meta Keywords Delete the columns. If those cells have content, consider removing that tag from the site.
DNS Safe URL, Path, Domain, Root, and TLD You should really only be working on a single top-level domain. Content audits for subdomains should probably be done separately. Thus, these columns can be deleted in most cases.
Duplicate Columns You should have two columns for the URL (The “Address” in column A from URL Profiler, and the “URL” column from Screaming Frog). Similarly, there may also be two columns each for HTTP Status and Status Code. It depends on the settings selected in both tools, but there are sure to be some overlaps, which can be removed to reduce the file size, enhance focus, and speed up the process.
Blank Columns Keep the filter tool active and go through each column. Those with only blank cells can be deleted. The example below shows that column BK (Robots HTTP Header) can be removed from the spreadsheet.
[You can save a lot of headspace by hiding or removing blank columns.]
Single-Value Columns If the column contains only one value, it can usually be removed. The screenshot below shows our non-secure site does not have any HTTPS URLs, as expected. I can now remove the column. Also, I guess it’s probably time I get that HTTPS migration project scheduled.
Hopefully by now you've made a significant dent in reducing the overall size of the file and time it takes to apply formatting and formula changes to the spreadsheet. It’s time to start diving into the data.
The analysis & recommendations phase
Here's where the fun really begins. In a large organization, it's tempting to have a junior SEO do all of the data-gathering up to this point. I find it useful to perform the crawl myself, as the process can be highly informative.
Step 3: Put it all into a dashboard
Even after removing unnecessary data, performance could still be a major issue, especially if working in Google Sheets. I prefer to do all of this in Excel, and only upload into Google Sheets once it's ready for the client. If Excel is running slow, consider splitting up the URLs by directory or some other factor in order to work with multiple, smaller spreadsheets.
Creating a dashboard can be as easy as adding two columns to the spreadsheet. The first new column, “Action,” should be limited to three options, as shown below. This makes filtering and sorting data much easier. The “Details” column can contain freeform text to provide more detailed instructions for implementation.
Use Data Validation and a drop-down selector to limit Action options.
Step 4: Work the content audit dashboard
All of the data you need should now be right in front of you. This step can’t be turned into a repeatable process for every content audit. From here on the actual step-by-step process becomes much more open to interpretation and your own experience. You may do some of them and not others. You may do them a little differently. That's all fine, as long as you're working toward the goal of determining what to do, if anything, for each piece of content on the website.
A good place to start would be to look for any content-related issues that might cause an algorithmic filter or manual penalty to be applied, thereby dragging down your rankings.
Causes of content-related penalties
These typically fall under three major categories: quality, duplication, and relevancy. Each category can be further broken down into a variety of issues, which are detailed below.
{Expand to learn more about quality, duplication, and relevancy issues}
Typical low-quality content
Poor grammar, written primarily for search engines (includes keyword stuffing), unhelpful, inaccurate...
Completely irrelevant content
OK in small amounts, but often entire blogs are full of it.
A typical example would be a "linkbait" piece circa 2010.
Thin/short content
Glossed over the topic, too few words, or all image-based content.
Curated content with no added value
Comprised almost entirely of bits and pieces of content that exists elsewhere.
Misleading optimization
Titles or keywords targeting queries for which content doesn't answer or deserve to rank.
Generally not providing the information the visitor was expecting to find.
Duplicate content
Internally duplicated on other pages (e.g., categories, product variants, archives, technical issues, etc.).
Externally duplicated (e.g., manufacturer product descriptions, product descriptions duplicated in feeds used for other channels like Amazon, shopping comparison sites and eBay, plagiarized content, etc.)
Stub pages (e.g., "No content is here yet, but if you sign in and leave some user-generated-content, then we'll have content here for the next guy." By the way, want our newsletter? Click an AD!)
Indexable internal search results
Too many indexable blog tag or blog category pages
And so on and so forth...
It helps to sort the data in various ways to see what’s going on. Below are a few different things to look for if you’re having trouble getting started.
{Expand to learn more about what to look for}
Sort by duplicate content risk
URL Profiler now has a native duplicate content checker. Other options are Copyscape (for external duplicate content) and Siteliner (for internal duplicate content).
Which of these pages should be rewritten?
Rewrite key/important pages, such as categories, home page, top products
Rewrite pages with good link and social metrics
Rewrite pages with good traffic
After selecting "Improve" in the Action column, elaborate in the Details column:
"Improve these pages by writing unique, useful content to improve the Copyscape risk score."
Which of these pages should be removed/pruned?
Remove guest posts that were published elsewhere
Remove anything the client plagiarized
Remove content that isn't worth rewriting, such as:
No external links, no social shares, and very few or no entrances/visits
After selecting "Remove" from the Action column, elaborate in the Details column:
"Prune from site to remove duplicate content. This URL has no links or shares and very little traffic. We recommend allowing the URL to return 404 or 410 response code. Remove all internal links, including from the sitemap."
Which of these pages should be consolidated into others?
Presumably none, since the content is already externally duplicated.
Which of these pages should be left “As-Is”?
Important pages which have had their content stolen
Sort by entrances or visits (filtering out any that were already finished)
Which of these pages should be marked as "Improve"?
Pages with high visits/entrances but low conversion, time-on-site, pageviews per session, etc.
Key pages that require improvement determined after a manual review of the page.
Which of these pages should be marked as "Consolidate"?
When you have overlapping topics that don't provide much unique value of their own, but could make a great resource when combined.
Mark the page in the set with the best metrics as "Improve" and in the Details column, outline which pages are going to be consolidated into it. This is the canonical page.
Mark the pages that are to be consolidated into the canonical page as "Consolidate" and provide further instructions in the Details column, such as:
Use portions of this content to round out /canonicalpage/ and then 301 redirect this page into /canonicalpage/
Update all internal links.
Campaign-based or seasonal pages that could be consolidated into a single "Evergreen" landing page (e.g., Best Sellers of 2012 and Best Sellers of 2013 ---> Best Sellers).
Which of these pages should be marked as "Remove"?
Pages with poor link, traffic, and social metrics related to low-quality content that isn't worth updating
Typically these will be allowed to 404/410.
Irrelevant content
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Out-of-date content that isn't worth updating or consolidating
The strategy will depend on link equity and traffic as to whether it gets redirected or simply removed.
Which of these pages should be marked as "Leave As-Is"?
Pages with good traffic, conversions, time on site, etc. that also have good content.
These may or may not have any decent external links.
Taking the hatchet to bloated websites
For big sites, it's best to use a hatchet-based approach as much as possible, and finish up with a scalpel in the end. Otherwise, you'll spend way too much time on the project, which eats into the ROI.
This is not a process that can be documented step-by-step. For the purpose of illustration, however, below are a few different examples of hatchet approaches and when to consider using them.
{Expand for examples of hatchet approaches}
Parameter-based URLs that shouldn't be indexed
Defer to the technical audit, if applicable. Otherwise, use your best judgment:
e.g., /?sort=color, &size=small
Assuming the tech audit didn't suggest otherwise, these pages could all be handled in one fell swoop. Below is an example Action and example Details for such a page:
Action = Remove
Details = Rel canonical to the base page without the parameter
Internal search results
Defer to the technical audit if applicable. Otherwise, use your best judgment:
e.g., /search/keyword-phrase/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
Blog tag pages
Defer to the technical audit if applicable. Otherwise:
e.g., /blog/tag/green-widgets/ , blog/tag/blue-widgets/
Assuming the tech audit didn't suggest otherwise:
Action = Remove
Details = Apply a noindex meta tag. Once they are removed from the index, disallow /search/ in the robots.txt file.
E-commerce product pages with manufacturer descriptions
In cases where the "Page Type" is known (i.e., it's in the URL or was provided in a CMS export) and Risk Score indicates duplication:
e.g., /product/product-name/
Assuming the tech audit didn't suggest otherwise:
Action = Improve
Details = Rewrite to improve product description and avoid duplicate content
E-commerce category pages with no static content
In cases where the "Page Type" is known:
e.g. /category/category-name/ or category/cat1/cat2/
Assuming NONE of the category pages have content:
Action = Improve
Details = Write 2–3 sentences of unique, useful content that explains choices, next steps, or benefits to the visitor looking to choose a product from the category.
Out-of-date blog posts, articles, and other landing pages
In cases where the title tag includes a date, or...
In cases where the URL indicates the publishing date:
Action = Improve
Details = Update the post to make it more current, if applicable. Otherwise, change Action to "Remove" and customize the Strategy based on links and traffic (i.e., 301 or 404).
Content marked for improvement should lay out more specific instructions in the “Details” column, such as:
Update the old content to make it more relevant
Add more useful content to “beef up” this thin page
Incorporate content from overlapping URLs/pages
Rewrite to avoid internal duplication
Rewrite to avoid external duplication
Reduce image sizes to speed up page load
Create a “responsive” template for this page to fit on mobile devices
Etc.
Content marked for removal should include specific instructions in the “Details” column, such as:
Consolidate this content into the following URL/page marked as “Improve”
Then redirect the URL
Remove this page from the site and allow the URL to return a 410 or 404 HTTP status code. This content has had zero visits within the last 360 days, and has no external links. Then remove or update internal links to this page.
Remove this page from the site and 301 redirect the URL to the following URL marked as “Improve”... Do not incorporate the content into the new page. It is low-quality.
Remove this archive page from search engine indexes with a robots noindex meta tag. Continue to allow the page to be accessed by visitors and crawled by search engines.
Remove this internal search result page from the search engine indexed with a robots noindex meta tag. Once removed from the index (about 15–30 days later), add the following line to the #BlockedDirectories section of the robots.txt file: Disallow: /search/.
As you can see from the many examples above, sorting by “Page Type” can be quite handy when applying the same Action and Details to an entire section of the website.
After all of the tool set-up, data gathering, data cleanup, and analysis across dozens of metrics, what matters in the end is the Action to take and the Details that go with it.
URL, Action, and Details: These three columns will be used by someone to implement your recommendations. Be clear and concise in your instructions, and don’t make decisions without reviewing all of the wonderful data-points you’ve collected.
Here is a sample content audit spreadsheet to use as a template, or for ideas. It includes a few extra tabs specific to the way we used to do content audits at Inflow.
WARNING!
As Razvan Gavrilas pointed out in his post on Cognitive SEO from 2015, without doing the research above you risk pruning valuable content from search engine indexes. Be bold, but make highly informed decisions:
Content audits allow SEOs to make informed decisions on which content to keep indexed “as-is,” which content to improve, and which to remove.
The reporting phase
The content audit dashboard is exactly what we need internally: a spreadsheet crammed with data that can be sliced and diced in so many useful ways that we can always go back to it for more insight and ideas. Some clients appreciate that as well, but most are going to find the greater benefit in our final content audit report, which includes a high-level overview of our recommendations.
Counting actions from Column B
It is useful to count the quantity of each Action along with total organic search traffic and/or revenue for each URL. This will help you (and the client) identify important metrics, such as total organic traffic for pages marked to be pruned. It will also make the final report much easier to build.
Step 5: Writing up the report
Your analysis and recommendations should be delivered at the same time as the audit dashboard. It summarizes the findings, recommendations, and next steps from the audit, and should start with an executive summary.
Here is a real example of an executive summary from one of Inflow's content audit strategies:
As a result of our comprehensive content audit, we are recommending the following, which will be covered in more detail below:
Removal of about 624 pages from Google index by deletion or consolidation:
203 Pages were marked for Removal with a 404 error (no redirect needed)
110 Pages were marked for Removal with a 301 redirect to another page
311 Pages were marked for Consolidation of content into other pages
Followed by a redirect to the page into which they were consolidated
Rewriting or improving of 668 pages
605 Product Pages are to be rewritten due to use of manufacturer product descriptions (duplicate content), these being prioritized from first to last within the Content Audit.
63 "Other" pages to be rewritten due to low-quality or duplicate content.
Keeping 226 pages as-is
No rewriting or improvements needed
These changes reflect an immediate need to "improve or remove" content in order to avoid an obvious content-based penalty from Google (e.g. Panda) due to thin, low-quality and duplicate content, especially concerning Representative and Dealers pages with some added risk from Style pages.
The content strategy should end with recommended next steps, including action items for the consultant and the client. Below is a real example from one of our documents.
We recommend the following three projects in order of their urgency and/or potential ROI for the site:
Project 1: Remove or consolidate all pages marked as “Remove”. Detailed instructions for each URL can be found in the "Details" column of the Content Audit Dashboard.
Project 2: Copywriting to improve/rewrite content on Style pages. Ensure unique, robust content and proper keyword targeting.
Project 3: Improve/rewrite all remaining pages marked as “Improve” in the Content Audit Dashboard. Detailed instructions for each URL can be found in the "Details" column
Content audit resources & further reading
Understanding Mobile-First Indexing and the Long-Term Impact on SEO by Cindy Krum This thought-provoking post begs the question: How will we perform content inventories without URLs? It helps to know Google is dealing with the exact same problem on a much, much larger scale.
Here is a spreadsheet template to help you calculate revenue and traffic changes before and after updating content.
Expanding the Horizons of eCommerce Content Strategy by Dan Kern of Inflow An epic post about content strategies for eCommerce businesses, which includes several good examples of content on different types of pages targeted toward various stages in the buying cycle.
The Content Inventory is Your Friend by Kristina Halvorson on BrainTraffic Praise for the life-changing powers of a good content audit inventory.
Everything You Need to Perform Content Audits
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes