SEO data analysis

SEO data analysis

This blog post and the jupyter notebook aim to answer the following questions:

1) Decide whether diffferent collected SEO data are correlated.

2) How many days of web server logs are needed to calculate certain SEO metrics.

3) Identify the trend, seasonality in each collected SEO data.

Jupyter notebook is available at SEO Data Analysis Notebook

Collected SEO data

crawl.csv :  Number of unique  URLs crawled in 200 HTTP status code by googlebot per day between 2016 and 2018

google_analytics_data.csv: Google analytics data between 2016 and 2018

links.csv : A csv file downloaded from Google Search Console, including the discovered external links and their first discovery date between 2016 and 2018  

SEO data analysis

First import the necessary python libraries

import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose

Read the SEO data in files into  pandas dataframes

crawldata_colnames=['date', 'crawled_pages']
linkdata_colnames= ['links','date']
cd = pd.read_csv("crawl.csv",sep='\s',parse_dates=['date'], index_col='date', usecols=[*range(0,2)], names=crawldata_colnames, skiprows=1,header=None)
gad = pd.read_csv("google_analytics_data.csv", parse_dates=['ga:date'], index_col='ga:date', )
ld = pd.read_csv("links.csv", parse_dates=['date'], index_col='date',names=linkdata_colnames,skiprows=1,header=None

Count number of earned links per day

ld = ld.groupby(['date']).count()['links']

Select only organic search from google, counting the number of active pages per day

pa = gad.loc[gad['ga:sourceMedium'] == 'google / organic'].groupby(['ga:date']).count()['ga:pagePath']

Modify the column names in pa dataframe

pa = pa.reset_index()
pa.columns = ['date', 'active_pages']

Concatenate three data source in one dataframe

df = pd.concat([cd, pa, ld], axis=1)

Fill empty values with 0

df = df.fillna(0)

Resample daily data to weekly and check the correlations

dfw =  df.resample('W').sum()

                            crawled_pages     active_pages     links
crawled_pages     1.00000                0.119270     0.472280
active_pages        0.11927                1.000000     0.162117
links                      0.47228                0.162117     1.000000

Resample daily data to biweekly and check the correlations

df2w =  df.resample('MS', loffset=pd.Timedelta(14, 'd')).sum()

                            crawled_pages     active_pages     links
crawled_pages     1.000000            0.365306        0.569630
active_pages        0.365306            1.000000        0.253734
links                      0.569630            0.253734        1.000000

Resample daily data to monthly and check the correlations

dfm =  df.resample('M').sum()

                          crawled_pages     active_pages     links
crawled_pages   1.000000            0.365306       0.569630
active_pages      0.365306            1.000000       0.253734
links                    0.569630            0.253734       1.000000

Observed, trend, seasonal, residual  data analysis of crawled data after preprocessing with standardscaler

scaler = StandardScaler()
dfm[['crawled_pages', 'active_pages','links']] = scaler.fit_transform(dfm[['crawled_pages', 'active_pages','links']])

decomposition = seasonal_decompose(dfm['crawled_pages'], freq = 12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(15, 8)

seo data crawled pages

Observed, trend, seasonal, residual  data analysis of active pages data

decomposition = seasonal_decompose(dfm['active_pages'], freq = 12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(15, 8)

seo data active pages

Observed, trend, seasonal, residual  data analysis of  links data

decomposition = seasonal_decompose(dfm['links'], freq = 12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(15, 8)

SEO Data Analysis Links

SEO data analysis results

Concerning collected SEO data of this website:

1)  There are correlations between the number of unique crawled pages by googlebot, unique active pages receiving organic traffic from google  and the external links by their first-discovery date downloaded from Google Search Console, all SEO data sources collected as daily data later resampled to monthly data.

2)  If we would like to extract crawl data from its web server logs and cross with our crawl data of the website and calculate the SEO metrics, we need at least two weeks of web server logs since the correlations results between active pages and crawled pages of two weeks give better results than one week.

3)  As a trend  we see a drop in googlebot crawl on number of unique crawled pages while observing an increase in number of  active pages and the earned links collected as daily SEO data later resampled to monthly SEO data. About seasonality,  although we observe some sort of cycles in three sources of SEO data, it is not very obvious, we need more data or more data analysis to claim seasonality.

Next blog post following to this one is about forecasting SEO data which is available at URL SEO Forecasting

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

If you like this article

Follow Me on Twitter

Follow Searchdatalogy on Twitter


About Us

My objective is bringing all my experience and expertise together to deliver solid technology solutions that can take your search traffic acquisition to the next level. My main goal is to assist you in building and maintaining your search marketing analytics platforms. My will is to leverage your marketing and IT teams search knowledge while bridging the gap between two.


Botify: Botify Certified Consultant

IBM: Data Scientist, Data Engineering Certificates

Google: Google Analytics, Google Adwords, Mobile Sites, Digital Sales Certificated Professional

Coursera: Data Engineering on Google Cloud Platform Specialization

Legal Terms Privacy

Recent Posts

87 million domains pagerank 8 months, 3 weeks ago
SEO data forecasting 10 months ago
SEO data analysis 10 months ago
BrightonSEO conference 10 months, 3 weeks ago
HTTP2 on top sites 1 year, 1 month ago
Desktop & mobile performances 1 year, 6 months ago
Alexa top 1 million sites 1 year, 6 months ago
Best SEO conferences in 2019 1 year, 7 months ago
Web marketing festival 2 years, 1 month ago
Webcampday 2 years, 2 months ago
Queduweb 2 years, 3 months ago
1 million #SEO tweets 2 years, 6 months ago
SEO, six blind men & an elephant 2 years, 7 months ago
SEO hero 2017 2 years, 8 months ago
Digitalzone 2 years, 8 months ago
Technical SEO log analysis 2 years, 9 months ago
3 ways for free https 2 years, 10 months ago

Recent Tweets