SEO Forecasting

This blog post is published first on 2018-10-22  and updated on 2023-03-07.

With supervised machine learning, it is possible to solve mainly two types of problems which are prediction and classification.

In this blog post, using a time-series machine learning model, pmdarima's auto-arima, over three years of collected SEO data, I will predict a site's monthly active pages.

This blog post and jupyter notebook aims to answer the following questions:

1) Can we predict SEO data with other related SEO data that we know is correlated?

2) How to evaluate predicted SEO results?

3) How can we use this model and the predicted SEO results?

Jupyter Notebook is available at SEO forecasting notebook

Collected SEO data

crawl.csv :  Crawled pages data(crawled by googlebot) collected  between 2016 and 2019. 

'crawl unique'  : For a specific page on a day,  this column value is equal to 1, if this page is crawled at least once by googlebot on that day. 

active.csv: Active pages data(google search engine) between 2016 and 2019

'active unique' : For a specific page on a day, this column value is equal to 1, if this page receives at least one visit from google on that day.

SEO data analysis

This part is treated in the previous blog post  SEO data analysis

SEO forecasting: Forecasting monthly crawled pages

First import the necessary python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler

Read the SEO data in files into  pandas dataframe

scaler = MinMaxScaler()
df_crawl = pd.read_csv("crawl.csv",parse_dates=['date'], index_col='date')
df_active = pd.read_csv('active.csv',parse_dates=['date'], index_col='date')
df_sum_crawl = df_crawl['2016-08-01':'2019-07-31'].resample('M').sum()
df_sum_active = df_active['2016-08-01':'2019-07-31'].resample('M').sum()
data = pd.concat([df_sum_crawl['crawl unique'],df_sum_active['active unique']],axis=1)
data[['crawl unique', 'active unique']] = scaler.fit_transform(data[['crawl unique', 'active unique']])

Split the SEO data to train and test SEO data sets. Our target SEO variable is active pages, our exogenous SEO variable is the crawled pages. Seasonal set to True, stepwise=True in modeling. 

# #############################################################################
# Load the data and split it into separate pieces

train, test = data[:29], data[29:]

# Fit a simple auto_arima model
forecast_active_pages = pm.auto_arima(y=train['active unique'], exogenous=train["crawl unique"].values.reshape(-1, 1), trace=1,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings


Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-54.692, BIC=-45.366, Fit time=0.384 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=-45.793, BIC=-41.797, Fit time=0.073 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=-44.850, BIC=-39.521, Fit time=0.057 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=-45.530, BIC=-40.202, Fit time=0.093 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-47.668, BIC=-39.674, Fit time=0.294 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-50.596, BIC=-39.939, Fit time=0.396 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=-52.664, BIC=-44.671, Fit time=0.354 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-57.754, BIC=-47.096, Fit time=0.414 seconds
Fit ARIMA: order=(3, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=-54.710, BIC=-41.388, Fit time=0.532 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-47.499, BIC=-38.174, Fit time=0.406 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-56.027, BIC=-44.038, Fit time=0.448 seconds
Fit ARIMA: order=(2, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=-55.863, BIC=-43.873, Fit time=0.561 seconds
Total fit time: 4.029 seconds
Statespace Model Results
Dep. Variable: y No. Observations: 29
Model: SARIMAX(2, 1, 3) Log Likelihood 36.877
Date: Sun, 01 Sep 2019 AIC -57.754
Time: 14:12:35 BIC -47.096
Sample: 0 HQIC -54.496
- 29
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 0.0673 0.054 1.243 0.214 -0.039 0.173
x1 -0.0737 0.040 -1.824 0.068 -0.153 0.005
ar.L1 -0.7561 0.148 -5.111 0.000 -1.046 -0.466
ar.L2 -0.8696 0.161 -5.391 0.000 -1.186 -0.553
ma.L1 0.9385 0.346 2.716 0.007 0.261 1.616
ma.L2 1.0481 0.413 2.539 0.011 0.239 1.857
ma.L3 0.6983 0.537 1.300 0.194 -0.355 1.751
sigma2 0.0036 0.002 2.002 0.045 7.6e-05 0.007
Ljung-Box (Q): nan Jarque-Bera (JB): 2.70
Prob(Q): nan Prob(JB): 0.26
Heteroskedasticity (H): 0.44 Skew: -0.57
Prob(H) (two-sided): 0.24 Kurtosis: 4.00

# #############################################################################
# Plot actual test vs. forecasts:
x = np.arange(test['active unique'].shape[0])
plt.scatter(x, test['active unique'], marker='x')
plt.plot(x, forecast_active_pages.predict(exogenous=test['crawl unique'].values.reshape(-1, 1),n_periods=test.shape[0]))
plt.title('Actual test samples vs. forecasts');

Actual active pages vs forecasted active pages

Forecast SEO data in our case, volume of monthly resampled daily active pages in 2019 and plot observed vs forecasted data 

predicts = pd.DataFrame(forecast_active_pages.predict(exogenous=test['crawl unique'].values.reshape(-1, 1),n_periods=test.shape[0]), index = test.index)
pd.concat([test['active unique'],predicts],axis=1).plot(figsize=(15,8))

Observed active pages vs Forecasted active pages

SEO Forecasting results

Regarding the prediction of this website's SEO data:

1) In the previous blog post on SEO data analysis, it was found that there are correlations between the number of unique pages crawled by googlebot and the unique active pages receiving organic traffic from google. All SEO data sources collected as datetime data later resampled to monthly data.

2) In the previous blog post on SEO data analysis, seasonality is detected on crawl and active data. Therefore, in the model, seasonal is set to True.

3) The model created above is not optimized so the predictions are not accurate. In order to optimize the model, we have several options e.g. we can recreate the model with unique crawled pages in HTTP status code 200. The model can also be optimized by hyperparameter tuning.

4) More exogenous SEO variables can be added to the model if found to be correlated with the target variable, such as marketing spend, google trending data of keywords important to the website, average website ranking on google for these important keywords etc

5) The model can be used to evaluate our SEO work. If, for example, the predicted number of active pages and the error measure of the number of active pages observed give very different results from previous ones where there is no SEO work, it can be assumed that there is had a hidden variable, most likely an "SEO job"

6) As we forecast the number of monthly unique active pages, the number of monthly unique crawled pages on a site can also be forecast. This can help us identify anomalies in the number of crawled pages; decrease or increase over time and whether they are expected or unexpected behaviors.

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

