As part of our in-house work, each quarter we create a whitepaper exploring different ideas and different sectors with which we are familiar. In the past, we have covered our expectations for the Future of Content, and content marketing in the Travel sector.
With an upcoming piece looking into the advantages of creating content both on- and off-brand in the finance sector, now seems like a good time to provide some insight into how we tackle the large data sets we use, including tools, tips, and tricks for making your life easy.
Building the Data Set
Part of the process of analysing a large data set is, of course, building a large data set. For the travel whitepaper, we did this exclusively using content sharing websites like Visual.ly and Daily Infographic, which we first crawled to get their content webpages, and then scraped to find content titles and original URLs.
For our upcoming finance whitepaper, this was not our complete approach. In order to focus specifically on the larger names from the financial sector, we preselected a list of companies for whom we wanted to find content, and using the Ahrefs API, we developed a system which sourced the top content produced on that site (ranked by SEO metrics). We then supplemented this with the content sharing site approach to bring it up to our total of over 2300 pieces of content.
Tool Tip – When conducting scrapes and crawls, we use two main tools. Screaming Frog is an industry standard thanks to its versatility and relative ease of use, but we also use OutWit Hub, as it’s much easier to use where webpages have clear titling or subtitling before information you want to extract, and doesn’t rely on pathing information in the same way that basic Screaming Frog crawls do.
The main issue with both our top content and content sharing site scraps methods is the large proportion of junk URLs we accumulate in the process. Because automated systems have no real way of differentiating between dead URLs, redirecting URLs, home or hub page URLs, and actual live URLs, we have to go through a process of semi-manual data cleaning to ensure our data set is of sufficiently high quality to analyse.
Screaming Frog for 404s and 301s
The best way to tackle this is to start with the least manual methods first and work towards the most manual ones. For example, 404 errors and 301 redirects we tracked down with Screaming Frog, though the redirects have to be run through several times as each provides the next step in the redirect chain – and it’s easy to fall into a loop.
Common sense is required to make data cleaning work! Where there’s no easy answer, it’s a lot easier just to delete a small handful of URLs than to scrounge for a few extra ones.
Once we ensure all of our URLs are live and final destinations, we need to make sure that they’re not just homepages or hub pages. This is, regrettably, only readily doable with a bit of manual labour. But there are ways to save some time and effort on picking up the obvious ones.
Using Excel or Google Sheets formulae is the best way to sort data. A simple =LEN(A1) formula gives you the number of characters in cell A1. Do this for your entire URL list and sort from low to high, and you’re likely to get most of the homepages at or near the top of your list of data thanks to their shorter URL structures. This means you can delete them in large blocks rather than hunting for them one at a time.
In similar fashion, most pieces of content will be in a series of subfolders – meaning you can pick up quite a few of them by setting a formula to count the number of forward slashes in each URL. A:
formula will give you the number in cell A1; sort from low to high again to give yourself another set of blocks ready for deletion.
This final helpful method can’t be completed until after the next step of gathering numerical data about the URLs – though it can be performed regardless of what data is gathered.
Simply sorting by highest to lowest on data scores will often bring homepages and hub pages to the top of the list – as these are likely to have the highest SEO-relevant scores due to numerous links pointing towards them.
After the above steps, the majority of URLs should be relevant and valid. Some will, however, still have to be cleaned manually by eye.
In our finance whitepaper, a large part of the focus is given over to identifying the value of different keyword categories, such as Travel or Gender. In order to identify the different categories available, we selected numerous keywords that we could see appearing in a handful of the content titles acquired through scraping. We then generated as many synonyms and related words as possible for each of the keywords found.
For each keyword thusly created, we used a COUNTIF formula to identify how many times each word occurred. We also grouped similar words (such as “Holiday” and “Journey”) and assigned them to a category (in this case, “Travel”) to show which categories came up most frequently.
Where relevant, we needed to include spaces around words to avoid counting them where they’re part of other words – for example, “men” could appear as “government” and therefore needed to be tested as “ men”, “men “ and “ men “ and played around with manually to get a clear result.
As well as keyword categories, there were some other sections we wanted to include for analysis in both the travel and finance whitepapers. Numbers, however, proved an issue.
While with other categories we used longer and more specific keywords to keep the instances where a keyword would show up twice in a title to a minimum (e.g. a title is unlikely to include “Life Insurance” or “Instagram” twice – and we accepted a small degree of error in these cases as it only affected the frequency testing), numbers frequently show up more than once, such as 11 or 22. And even in cases where the numbers are different – such as 2019 – basic formulae would count each number separately.
In order to eliminate this problem, we used a different formula to count how many cells didn’t contain numbers:
=2310-COUNTIFS(‘3A – Combined Sheet’!B2:B2311,”<>*0*”,’3A – Combined Sheet’!B2:B2311,”<>*1*”,’3A – Combined Sheet’!B2:B2311,”<>*2*”,’3A – Combined Sheet’!B2:B2311,”<>*3*”,’3A – Combined Sheet’!B2:B2311,”<>*4*”,’3A – Combined Sheet’!B2:B2311,”<>*5*”,’3A – Combined Sheet’!B2:B2311,”<>*6*”,’3A – Combined Sheet’!B2:B2311,”<>*7*”,’3A – Combined Sheet’!B2:B2311,”<>*8*”,’3A – Combined Sheet’!B2:B2311,”<>*9*”)
– where 2310 is the total number of pieces of content in our study. It’s a bit of a clunky formula to build, but as one of Murphy’s Laws of War state, if it’s stupid and it works, it ain’t stupid.
Using the Ahrefs API in conjunction with some other supporting tools including SharedCount, we gathered the “URL Rating” (a logarithmic Ahrefs analysis of the backlink profile scored between 1-100), number of referring domains, and number of social shares for each link. These scores are the basis of the vast majority of our statistical analysis, in one form or another.
The primary way that we analyse these scores in our data-heavy whitepapers is to use a “reliability” score. This is, in fact, the median of the results included. The change in name is to avoid confusion – while an average score is more commonly understood, the median provides more accurate information in terms of likely results for any given piece, as it discounts outlying scores.
The issue with this is that while Excel offers an “AVERAGEIF” formula, which would allow us to find the average of results containing individual words, it doesn’t offer a “MEDIANIF” formula. Accordingly, we have had to work out a way to cheat around it with a manual formula – which is easier said than done.
For each of the three metric columns – URL Rating, Referring Domains, and Social Shares – we create a second column to contain a formula identifying whether the cell contains the word we want to find a median for. Each of these columns contains the formula:
– where “L$3” refers to a reference cell (where the word to test is written), “$B2” refers to the content title to test, and “D2” refers to the column with the results – in this case, the URL Rating. “Nah” is just our informal way of identifying if the cell doesn’t contain the word we want to test.
The output of this formula is one of two things – either our “Nah” tester, or the numerical result for that title, if it contains the word we want. Then we can simply do a =MEDIAN(RANGE) formula for that column. For convenience’s sake, we also did the average scores with this method as well.
Keyword Category and Sector Analysis
Identifying the numerical scores for keyword categories is slightly more complicated; each category is first formed into a table contained each of the relevant keywords and named after the category. A second table containing all the content pieces is then created. This second table also includes columns for each of the keyword categories, to test whether the title contains a keyword.
In each cell for each column, we use the formula:
to identify if the title has more than 0 keywords. “@Title” refers to the column with the titles in, and “Table2[Business]” refers to the keyword category. The output of this is 0 or a higher number, depending on whether it has keywords in or not. Any higher numbers are actually irrelevant – we only test for whether it’s “0” or not.
We then have a third table, in which each column tests whether there is a 0 or not in the columns of the previous table. The formula we use is:
-the output being either a “Neh” (again, our rather amused way of keeping focused, as a page full of “Neh” is both a little funny and good to copy as a reaction image for Slack, while testing to see if the cell in question does in fact contain a 0) or the metric required – in the case of the above formula, URL Rating.
Finally, as with individual keyword results, we simply create AVERAGE and MEDIAN formulae for each column – giving us our final results for each category.
For Your Consideration
While the above is a simple guide to the tools and formulae we’ve used in the past to create our whitepapers, we are constantly improving our methods and developing new approaches using new technology – as well as applying common sense to extract different points of analysis depending on the specific subject of the discussion.
Although these methods are uniform and increasingly automated, the only way to achieve true analysis with current technology is to use the expert knowledge of your team to point out relevant points for comparison and breakdown. These tips are, therefore, intended as a basis for inspiring you and your co-workers – you’ll still need to put some brainpower in to get the most out!
We’re thrilled to announce that Kaizen have been chosen as area finalists for the 2019 FSB Awards. We are nominated for the ‘Scale Up Business of the Year’ award, as over the past year we have nearly doubled our number of employees, set up camp in Moorgate, launched our own proprietary PR technology and been … Read more
As seen on Startup Nation. With research suggesting that stories are 22 times more memorable than facts, what can brands do with storytelling to ensure they remain head and shoulders above the competition? Many brands already have the foundations to build and sell their story but fail to use these to their full potential when it comes … Read more
Leading European travel company, Lastminute.com, have recently selected Kaizen as their new SEO, Content Marketing and Digital PR agency for 2019 and beyond. An online travel retailer, Lastminute.com has over 90 million unique monthly users, and offers global flights and hotel deals. As part of their content marketing strategy with Kaizen, we created a campaign … Read more