SEO data forecasting
With supervised machine learning, it is possible to solve mainly two type of problems which are prediction and classification.
In this blog post, by using a time series machine learning model, pyramid's auto-arima, on two years of collected SEO data, I will forecast monthly active pages of a site.
This blog post and the jupyter notebook aim to answer the following questions:
1) Can we forecast SEO data with other related SEO data which we know they are correlated?
2) How can we evaluate the forecasted SEO results?
3) How can we use this model and the forecasted SEO results?
Jupyter notebook is available at SEO data forecasting notebook
Collected SEO data
crawl.csv : Number of unique URLs crawled in 200 HTTP status code by googlebot per day between 2016 and 2018
google_analytics_data.csv : Google Analytics data between 2016 and 2018
links.csv : A csv file downloaded from Google Search Console, including the discovered external links and their first discovery date between 2016 and 2018
SEO data analysis
This part is treated in the previous blog post SEO data analysis
SEO forecasting: Forecasting monthly active pages
First import the necessary python libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
Read the SEO data in files into pandas dataframes
crawldata_colnames = ['date', 'crawled_pages']
linkdata_colnames = ['links','date']
cd = pd.read_csv("crawl.csv",sep='\s',parse_dates=['date'], index_col='date', usecols=[*range(0,2)], names=crawldata_colnames, skiprows=1,header=None)
gad = pd.read_csv("google_analytics_data.csv", parse_dates=['ga:date'], index_col='ga:date', )
ld = pd.read_csv("links.csv", parse_dates=['date'], index_col='date',names=linkdata_colnames,skiprows=1,header=None
Count number of earned links per day
ld = ld.groupby(['date']).count()['links']
Select only organic search from google, counting the number of active pages per day
pa = gad.loc[gad['ga:sourceMedium'] == 'google / organic'].groupby(['ga:date']).count()['ga:pagePath']
Modify the column names in pa dataframe
pa = pa.reset_index()
pa.columns = ['date', 'active_pages']
Concatenate three data source in one dataframe
df = pd.concat([cd, pa, ld], axis=1)
Fill empty values with 0
df = df.fillna(0)
Resample daily data to monthly and remove the rows including column values as 0
dfm = df.resample('M').sum()
dfm = dfm[(dfm!= 0).all(axis=1)]
Preprocess data with standardscaler
scaler = StandardScaler()
dfm[['crawled_pages', 'active_pages','links']] = scaler.fit_transform(dfm[['crawled_pages', 'active_pages','links']])
Split the SEO data to train and test SEO data sets. Our target SEO variable is active pages, our exogenous SEO data are the crawled pages and the links. Seasonal set to false, stepwise=True in modeling.
dftrain = dfm.loc['2016-09-30':'2018-08-31',]
dftest = dfm.loc['2018-09-30':'2018-09-30',]
test_pa = dftest['active_pages']
y = np.array(dftrain['active_pages'])
exogenous = np.array(dftrain[['crawled_pages','links']])
pa_fit_exo = auto_arima(y=y, exogenous=exogenous, start_p=0, start_q=0, max_p=3, max_q=3,
start_P=0, seasonal=False, d=1, D=0, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
Fit ARIMA: order=(0, 1, 0); AIC=88.654, BIC=93.196, Fit time=0.018 seconds Fit ARIMA: order=(1, 1, 0); AIC=89.012, BIC=94.690, Fit time=0.100 seconds Fit ARIMA: order=(0, 1, 1); AIC=77.372, BIC=83.050, Fit time=0.115 seconds Fit ARIMA: order=(1, 1, 1); AIC=79.368, BIC=86.181, Fit time=0.200 seconds Fit ARIMA: order=(0, 1, 2); AIC=nan, BIC=nan, Fit time=nan seconds Fit ARIMA: order=(1, 1, 2); AIC=81.373, BIC=89.321, Fit time=0.545 seconds Total fit time: 0.985 seconds
|Dep. Variable:||D.y||No. Observations:||23|
|Model:||ARIMA(0, 1, 1)||Log Likelihood||-33.686|
|Method:||css-mle||S.D. of innovations||0.977|
|Date:||Mon, 22 Oct 2018||AIC||77.372|
Forecast SEO data in our case, active pages in September 2018
pa_future_forecast = pa_fit_exo.predict(n_periods=len(dftest.index), exogenous = np.array(dftest[['crawled_pages','links']]))
Plot observed and forecasted seo data, active pages in September 2018
pa_df = pd.DataFrame(pa_future_forecast, index = dftest.index)
observed_v_forecasted = pd.concat([dftest.active_pages,pa_df],axis=1)
observed_v_forecasted_colnames = ['observed_active_pages','forecasted_active_pages']
observed_v_forecasted.columns = observed_v_forecasted_colnames
observed_v_forecasted.index = observed_v_forecasted.index.strftime('%Y-%b')
plt.ylabel('Scaled mumber of pages', fontsize=10)
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title('Observed v. forecasted number of monthly active pages')
Evaluate the SEO prediction model
mae = mean_absolute_error(test_pa, pa_future_forecast)
print('MAE: %f' % mae)
mse = mean_squared_error(test_pa, pa_future_forecast)
print('MSE: %f' % mse)
rmse = sqrt(mse)
print('RMSE: %f' % rmse)
Inverse the scaled active pages
scaled_active_pages_inverse = scaler.inverse_transform(np.column_stack( (pa_future_forecast,np.array(dftest[['crawled_pages','links']])) ))[:, ]
SEO data forecasting results
Concerning SEO data prediction of this website:
1) In the previous blog post about seo data analysis, we found that there are correlations between the number of unique crawled pages by googlebot, unique active pages receiving organic traffic from google and the external links by their first-discovery date downloaded from Google Search Console, all SEO data sources collected as daily data later resampled to monthly data.
2) In the previous blog post, as a trend we see a drop in googlebot crawl on number of unique crawled pages while observing an increase in number of unique active pages and the earned links collected as daily SEO data later resampled to monthly data. Seasonality in three sources of SEO data was not obvious.
3) Since we are not sure about the seasonality in our SEO data, in our machine learning model we set seasonal to False. However some sites may have explicitly seasonality.
4) We can add more exogenous SEO variables into our model if we see that they are correlated with the target variable, such as marketing expenses, google trends data of important keywords for the website, average ranking of the website on google for these important keywords etc.
5) We can use this model to evaluate our SEO work. If for example, the forecasted number of active pages and observed number of active pages error measuring give highly different results than the previous ones where there are no SEO work, we can suppose that there has been a lurking variable, most probably an "SEO work"
6) As we forecast number of monthly unique active pages, number of monthly unique crawled pages on a site can be forecasted too. This can help us to identify anomalies in number of crawled pages easily; drop or increase in time, and if they are expected or unexpected behaviours.
Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn
Have comments, questions or feedback about this article? Please do share them with us here.
If you like this article