SEO data forecasting

SEO data forecasting

With  supervised machine learning, it is possible to solve mainly two type of problems which are prediction and classification.

In this blog post, by using a time series machine learning model, pyramid's auto-arima, on two years of  collected SEO data, I will forecast monthly active pages of a site.

This blog post and the jupyter notebook aim to answer the following questions:

1) Can we forecast SEO data with other related SEO data which we know they are correlated?

2) How can we evaluate the forecasted SEO results?

3) How can we use this model and the forecasted SEO results?

Jupyter notebook is available at SEO data forecasting notebook

Collected SEO data

crawl.csv : Number of unique URLs crawled in 200 HTTP status code by googlebot per day between 2016 and 2018

google_analytics_data.csv : Google Analytics data between 2016 and 2018

links.csv : A csv file downloaded from Google Search Console, including the discovered external links and their first discovery date between 2016 and 2018  

SEO data analysis

This part is treated in the previous blog post  SEO data analysis

SEO forecasting: Forecasting monthly active pages

First import the necessary python libraries

import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa.seasonal import seasonal_decompose

Read the SEO data in files into  pandas dataframes

crawldata_colnames = ['date', 'crawled_pages']
linkdata_colnames = ['links','date']
cd = pd.read_csv("crawl.csv",sep='\s',parse_dates=['date'], index_col='date', usecols=[*range(0,2)], names=crawldata_colnames, skiprows=1,header=None)
gad = pd.read_csv("google_analytics_data.csv", parse_dates=['ga:date'], index_col='ga:date', )
ld = pd.read_csv("links.csv", parse_dates=['date'], index_col='date',names=linkdata_colnames,skiprows=1,header=None

Count number of earned links per day

ld = ld.groupby(['date']).count()['links']

Select only organic search from google, counting the number of active pages per day

pa = gad.loc[gad['ga:sourceMedium'] == 'google / organic'].groupby(['ga:date']).count()['ga:pagePath']

Modify the column names in pa dataframe

pa = pa.reset_index()
pa.columns = ['date', 'active_pages']

Concatenate three data source in one dataframe

df = pd.concat([cd, pa, ld], axis=1)

Fill empty values with 0

df = df.fillna(0)

Resample daily data to monthly and remove the rows including column values as 0

dfm =  df.resample('M').sum()
dfm =  dfm[(dfm!= 0).all(axis=1)]

Preprocess data  with standardscaler

scaler = StandardScaler()
dfm[['crawled_pages', 'active_pages','links']] = scaler.fit_transform(dfm[['crawled_pages', 'active_pages','links']])

Split the SEO data to train and test SEO data sets. Our target SEO variable is active pages, our exogenous SEO data are the crawled pages and the links. Seasonal set to false, stepwise=True in modeling. 

dftrain = dfm.loc['2016-09-30':'2018-08-31',]
dftest = dfm.loc['2018-09-30':'2018-09-30',]
test_pa = dftest['active_pages']
y = np.array(dftrain['active_pages'])
exogenous = np.array(dftrain[['crawled_pages','links']])
pa_fit_exo = auto_arima(y=y, exogenous=exogenous, start_p=0, start_q=0, max_p=3, max_q=3,
                    start_P=0, seasonal=False, d=1, D=0, trace=True,
                    error_action='ignore',  # don't want to know if an order does not work
                    suppress_warnings=True,  # don't want convergence warnings

Fit ARIMA: order=(0, 1, 0); AIC=88.654, BIC=93.196, Fit time=0.018 seconds
Fit ARIMA: order=(1, 1, 0); AIC=89.012, BIC=94.690, Fit time=0.100 seconds
Fit ARIMA: order=(0, 1, 1); AIC=77.372, BIC=83.050, Fit time=0.115 seconds
Fit ARIMA: order=(1, 1, 1); AIC=79.368, BIC=86.181, Fit time=0.200 seconds
Fit ARIMA: order=(0, 1, 2); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 2); AIC=81.373, BIC=89.321, Fit time=0.545 seconds
Total fit time: 0.985 seconds
Dep. Variable: D.y No. Observations: 23
Model: ARIMA(0, 1, 1) Log Likelihood -33.686
Method: css-mle S.D. of innovations 0.977
Date: Mon, 22 Oct 2018 AIC 77.372
Time: 13:20:32 BIC 83.050
Sample: 1 HQIC 78.800
coef std err z P>|z| [0.025 0.975]
const 0.0514 0.045 1.150 0.265 -0.036 0.139
x1 -0.0598 0.144 -0.414 0.683 -0.343 0.223
x2 0.0210 0.171 0.123 0.904 -0.314 0.356
ma.L1.D.y -1.0000 0.114 -8.739 0.000 -1.224 -0.776
Real Imaginary Modulus Frequency
MA.1 1.0000 +0.0000j 1.0000 0.0000


Forecast SEO data in our case, active pages in September 2018

pa_future_forecast = pa_fit_exo.predict(n_periods=len(dftest.index), exogenous = np.array(dftest[['crawled_pages','links']]))

Plot observed and forecasted seo data, active pages in September 2018

pa_df = pd.DataFrame(pa_future_forecast, index = dftest.index)
observed_v_forecasted = pd.concat([dftest.active_pages,pa_df],axis=1)
observed_v_forecasted_colnames = ['observed_active_pages','forecasted_active_pages']
observed_v_forecasted.columns = observed_v_forecasted_colnames
observed_v_forecasted.index = observed_v_forecasted.index.strftime('%Y-%b')
plt.ylabel('Scaled mumber of pages', fontsize=10)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title('Observed v. forecasted number of monthly active pages')

Forecasting SEO Data: Predict Active Pages

Evaluate the SEO prediction model

mae = mean_absolute_error(test_pa, pa_future_forecast)
print('MAE: %f' % mae)
mse = mean_squared_error(test_pa, pa_future_forecast)
print('MSE: %f' % mse)
rmse = sqrt(mse)
print('RMSE: %f' % rmse)

Inverse the scaled active pages

scaled_active_pages_inverse = scaler.inverse_transform(np.column_stack( (pa_future_forecast,np.array(dftest[['crawled_pages','links']])) ))[:, [0]]

SEO data forecasting results

Concerning SEO data prediction of this website:

1)  In the previous blog post about seo data analysis, we found that there are correlations between the number of unique crawled pages by googlebot, unique active pages receiving organic traffic from google  and the external links by their first-discovery date downloaded from Google Search Console, all SEO data sources collected as daily data later resampled to monthly data.

2) In the previous blog post,  as a trend  we see a drop in googlebot crawl on number of unique crawled pages while observing an increase in number of unique active pages and the earned links collected as daily SEO data later resampled to monthly data. Seasonality in three sources of SEO data was not obvious.

3) Since we are not sure about the seasonality in our SEO data, in our machine learning model we set seasonal to False. However some sites may have explicitly seasonality.

4) We can add more exogenous SEO variables into our model if we see that they are correlated with the target variable, such as marketing expenses, google trends data of important keywords for the website, average ranking of the website on google for these important keywords etc.

5) We can use this model to evaluate our SEO work. If for example, the forecasted number of active pages and observed number of active pages error measuring give highly different results than the previous ones where there are no SEO work, we can suppose that there has been a lurking variable, most probably an "SEO work

6) As we forecast number of monthly unique active pages, number of monthly unique crawled pages  on a site can be forecasted too. This can help  us to identify anomalies in number of crawled pages easily; drop or increase in time, and  if they are  expected or unexpected behaviours.  

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

If you like this article

Follow Me on Twitter

Follow Searchdatalogy on Twitter


About Us

My objective is bringing all my experience and expertise together to deliver solid technology solutions that can take your search traffic acquisition to the next level. My main goal is to assist you in building and maintaining your search marketing analytics platforms. My will is to leverage your marketing and IT teams search knowledge while bridging the gap between two.


Botify: Botify Certified Consultant

IBM: Data Scientist, Data Engineering Certificates

Google: Google Analytics, Google Adwords, Mobile Sites, Digital Sales Certificated Professional

Coursera: Data Engineering on Google Cloud Platform Specialization

Legal Terms Privacy

Recent Posts

87 million domains pagerank 7 months, 3 weeks ago
SEO data forecasting 8 months, 3 weeks ago
SEO data analysis 9 months ago
BrightonSEO conference 9 months, 3 weeks ago
HTTP2 on top sites 1 year ago
Desktop & mobile performances 1 year, 4 months ago
Alexa top 1 million sites 1 year, 5 months ago
Best SEO conferences in 2019 1 year, 6 months ago
Webcampday 2 years, 1 month ago
Queduweb 2 years, 2 months ago
1 million #SEO tweets 2 years, 5 months ago
SEO, six blind men & an elephant 2 years, 6 months ago
SEO hero 2017 2 years, 7 months ago
Digitalzone 2 years, 7 months ago
Technical SEO log analysis 2 years, 7 months ago
3 ways for free https 2 years, 9 months ago

Recent Tweets