SEO data forecasting

SEO data forecasting

This blog post is published first on 2018-10-22  and updated on 2019-08-21.

With  supervised machine learning, it is possible to solve mainly two type of problems which are prediction and classification.

In this blog post, by using a time series machine learning model, pmdarima's auto-arima, on three years of  collected SEO data, I will forecast monthly active pages of a site.

This blog post and the jupyter notebook aim to answer the following questions:

1) Can we forecast SEO data with other related SEO data which we know they are correlated?

2) How can we evaluate the forecasted SEO results?

3) How can we use this model and the forecasted SEO results?

Jupyter notebook is available at SEO data forecasting notebook

Collected SEO data

crawl.csv :  Crawled pages data(crawled by googlebot) collected  between 2016 and 2019. 

'crawl unique'  : For a specific page on a day,  this column value is equal to 1, if this page is crawled at least once by googlebot on that day. 

active.csv: Active pages data(google search engine) between 2016 and 2019

'active unique' : For a specific page on a day, this column value is equal to 1, if this page receives at least one visit from google on that day.

SEO data analysis

This part is treated in the previous blog post  SEO data analysis

SEO forecasting: Forecasting monthly crawled pages

First import the necessary python libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler

Read the SEO data in files into  pandas dataframe

scaler = MinMaxScaler()
df_crawl = pd.read_csv("crawl.csv",parse_dates=['date'], index_col='date')
df_active = pd.read_csv('active.csv',parse_dates=['date'], index_col='date')
df_sum_crawl = df_crawl['2016-08-01':'2019-07-31'].resample('M').sum()
df_sum_active = df_active['2016-08-01':'2019-07-31'].resample('M').sum()
data = pd.concat([df_sum_crawl['crawl unique'],df_sum_active['active unique']],axis=1)
data[['crawl unique', 'active unique']] = scaler.fit_transform(data[['crawl unique', 'active unique']])

Split the SEO data to train and test SEO data sets. Our target SEO variable is active pages, our exogenous SEO variable is the crawled pages. Seasonal set to True, stepwise=True in modeling. 

# #############################################################################
# Load the data and split it into separate pieces

train, test = data[:29], data[29:]

# Fit a simple auto_arima model
forecast_active_pages = pm.auto_arima(y=train['active unique'], exogenous=train["crawl unique"].values.reshape(-1, 1), trace=1,
seasonal=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True)

print(forecast_active_pages.summary())

Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-54.692, BIC=-45.366, Fit time=0.384 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=-45.793, BIC=-41.797, Fit time=0.073 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 0, 0, 1); AIC=-44.850, BIC=-39.521, Fit time=0.057 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=-45.530, BIC=-40.202, Fit time=0.093 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-47.668, BIC=-39.674, Fit time=0.294 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 0, 0, 1); AIC=-50.596, BIC=-39.939, Fit time=0.396 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 0, 0, 1); AIC=-52.664, BIC=-44.671, Fit time=0.354 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-57.754, BIC=-47.096, Fit time=0.414 seconds
Fit ARIMA: order=(3, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=-54.710, BIC=-41.388, Fit time=0.532 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-47.499, BIC=-38.174, Fit time=0.406 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 0, 0, 1); AIC=-56.027, BIC=-44.038, Fit time=0.448 seconds
Fit ARIMA: order=(2, 1, 4) seasonal_order=(0, 0, 0, 1); AIC=-55.863, BIC=-43.873, Fit time=0.561 seconds
Total fit time: 4.029 seconds
Statespace Model Results
==============================================================================
Dep. Variable: y No. Observations: 29
Model: SARIMAX(2, 1, 3) Log Likelihood 36.877
Date: Sun, 01 Sep 2019 AIC -57.754
Time: 14:12:35 BIC -47.096
Sample: 0 HQIC -54.496
- 29
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 0.0673 0.054 1.243 0.214 -0.039 0.173
x1 -0.0737 0.040 -1.824 0.068 -0.153 0.005
ar.L1 -0.7561 0.148 -5.111 0.000 -1.046 -0.466
ar.L2 -0.8696 0.161 -5.391 0.000 -1.186 -0.553
ma.L1 0.9385 0.346 2.716 0.007 0.261 1.616
ma.L2 1.0481 0.413 2.539 0.011 0.239 1.857
ma.L3 0.6983 0.537 1.300 0.194 -0.355 1.751
sigma2 0.0036 0.002 2.002 0.045 7.6e-05 0.007
===================================================================================
Ljung-Box (Q): nan Jarque-Bera (JB): 2.70
Prob(Q): nan Prob(JB): 0.26
Heteroskedasticity (H): 0.44 Skew: -0.57
Prob(H) (two-sided): 0.24 Kurtosis: 4.00
===================================================================================

# #############################################################################
# Plot actual test vs. forecasts:
x = np.arange(test['active unique'].shape[0])
plt.scatter(x, test['active unique'], marker='x')
plt.plot(x, forecast_active_pages.predict(exogenous=test['crawl unique'].values.reshape(-1, 1),n_periods=test.shape[0]))
plt.title('Actual test samples vs. forecasts')
plt.show();

Actual active pages vs forecasted active pages

Forecast SEO data in our case, volume of monthly resampled daily active pages in 2019 and plot observed vs forecasted data 

predicts = pd.DataFrame(forecast_active_pages.predict(exogenous=test['crawl unique'].values.reshape(-1, 1),n_periods=test.shape[0]), index = test.index)
pd.concat([test['active unique'],predicts],axis=1).plot(figsize=(15,8))
L=plt.legend()
L.get_texts()[0].set_text('ObservedTest')
L.get_texts()[1].set_text('ForecastedTest')

Observed active pages vs Forecasted active pages

SEO data forecasting results

Concerning SEO data prediction of this website:

1) In the previous blog post about SEO data analysis, it is found that there are correlations between the number of unique crawled pages by googlebot, unique active pages receiving organic traffic from google. All SEO data sources collected as datetime data later resampled to monthly data.

2 ) In the previous blog post about  SEO data analysis, seasonality is detected on crawl and active data. Therefore in the model seasonal is set to True.

3) The model created above is not optimized therefore the predictions are not accurate. In order to optimize the model, we have several options for ex. we can recreate the model with unique crawled pages in 200 HTTP status code. The model can also be optimized by hyperparameter tuning.

4) More exogenous SEO variables ican be added into the model if it is seen that they are correlated with the target variable, such as marketing expenses, google trends data of important keywords for the website, average ranking of the website on google for these important keywords etc.

5) The model can be used to evaluate our SEO work. If for example, the forecasted number of active pages and observed number of active pages error measuring give highly different results than the previous ones where there are no SEO work, it can be supposed that there has been a lurking variable, most probably an "SEO work

6) As we forecast number of monthly unique active pages, number of monthly unique crawled pages on a site can be forecasted too. This can help us to identify anomalies in number of crawled pages; drop or increase in time and  whether they are  expected or unexpected behaviours.  

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

If you like this article

Follow Me on Twitter

Follow Searchdatalogy on Twitter

Comments

About Us

My objective is bringing all my experience and expertise together to deliver solid technology solutions that can take your search traffic acquisition to the next level. My main goal is to assist you in building and maintaining your search marketing analytics platforms. My will is to leverage your marketing and IT teams search knowledge while bridging the gap between two.

Certificates

Botify: Botify Certified Consultant

IBM: Data Scientist, Data Engineering Certificates

Google: Google Analytics, Google Adwords, Mobile Sites, Digital Sales Certificated Professional

Coursera: Data Engineering on Google Cloud Platform Specialization

Legal Terms Privacy

Recent Posts

SEO data distribution analysis 2 months, 1 week ago
87 million domains pagerank 11 months, 3 weeks ago
SEO data analysis 1 year ago
BrightonSEO conference 1 year, 1 month ago
HTTP2 on top sites 1 year, 4 months ago
Desktop & mobile performances 1 year, 9 months ago
Alexa top 1 million sites 1 year, 9 months ago
Best SEO conferences in 2019 1 year, 10 months ago
Web marketing festival 2 years, 4 months ago
Webcampday 2 years, 5 months ago
Queduweb 2 years, 6 months ago
1 million #SEO tweets 2 years, 9 months ago
SEO, six blind men & an elephant 2 years, 10 months ago
SEO hero 2017 2 years, 11 months ago
Digitalzone 2 years, 11 months ago
Technical SEO log analysis 2 years, 12 months ago

Recent Tweets