SEO data distribution analysis

SEO Data Distribution Analysis

Before analyzing the distribution of SEO data, it is best to talk briefly about what SEO data is.

There are many sources of SEO data, but two are essential. One of them is data collected from websites crawls, the other one is data collected through web server logs of websites.

In this blog post, I would like to focus on analyzing the distribution of SEO data collected from the web server logs.

Again through the web server logs, it is mainly possible to obtain two types of data. One concerns the data on search engine bot crawls of the websites, which are crawled pages data, the other one is the search engine user visits to this website, which are  active pages data.

This blog post presents the SEO data distribution analysis of crawled and active pages data. This data distribution analysis is also available as jupyter notebook on github SEO Data Distribution Analysis.

The data distribution analysis can be helpful in our SEO work. In these type of data analysis we can see that the SEO data distribution per year or per category of the pages are different. Observing the differences in SEO data distribution in different time frames or page categories can be helpful in segmentation of the SEO data. The data distribution analysis can also be useful in estimating  roughly missing SEO data. For example, if we know certain category of pages have right-skewed daily crawl data distribution, in estimating missing daily crawl data of these type of pages we can take the median not the mean of these category of pages data. Or else if the data distribution is normal distribution then taking the mean of the data will be more accurate for estimating missing daily crawl data.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_crawl = pd.read_csv("crawl.csv",parse_dates=['date'], index_col='date')
df_active = pd.read_csv('active.csv',parse_dates=['date'], index_col='date')
df_sum_crawl = df_crawl.resample('D').sum()
df_sum_active = df_active.resample('D').sum()
df_sum = pd.concat([df_sum_crawl, df_sum_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-01

df_sum.hist(figsize=(8,6));

seo-data-distribution-analysis-02

df_sum['2016-08-01':'2017-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-03

df_sum['2017-08-01':'2018-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-04

df_sum['2018-08-01':'2019-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-05

scaler = MinMaxScaler()
include = ['/blog/']
exclude = ['/tag/','/archive/','page','.html','/category/','/en/','/admin/','author','//']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))]
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))]
df_select_crawl = df_select_crawl[(~df_select_crawl.request_url.str.endswith('/blog/'))]
df_select_active = df_select_active[(~df_select_active.request_url.str.endswith('/blog/'))]
df_select_crawl = df_select_crawl[(~df_select_crawl.request_url.str.contains('|'.join(exclude)))].resample('D').sum()
df_select_active = df_select_active[(~df_select_active.request_url.str.contains('|'.join(exclude)))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

blog articles crawl active pages data

df_sum['2019-01-01':'2019-08-01'].hist(figsize=(8,6));

blog articles crawl data distribution

df_sum['2019-01-01':'2019-08-01'].median()

crawl 0.076433
crawl unique 0.219512
active 0.423729
active unique 0.272727

dtype: float64

df_sum['2019-01-01':'2019-08-01'].mean()

crawl 0.095183
crawl unique 0.251002
active 0.420625
active unique 0.282757

dtype: float64

scaler = MinMaxScaler()
include = ['/tag/','/archive/','page','/category/']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))].resample('D').sum()
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-category-crawl-waste

df_sum.hist(figsize=(8,6));

seo-data-distribution-analysis-09

df_sum['2019-01-01':'2019-08-01'].median()

crawl 0.018182
crawl unique 0.024390
active 0.000000
active unique 0.000000

dtype: float64

scaler = MinMaxScaler()
include = ['.html']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))].resample('D').sum()
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-10

df_sum['2019-01-01':'2019-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-11

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

If you like this article

Follow Me on Twitter

Follow Searchdatalogy on Twitter

Comments

Legal Terms Privacy

Data SEO

Gael Gegourel: Data Engineer SEO

Hamlet Batista: CEO Ranksense

Walid Gabteni: Consultant SEO

Vincent Terrasi: Data Scientist SEO

Remi Bacha: Data Scientist SEO

Recent Posts

87 million domains pagerank 1 year, 8 months ago
SEO data forecasting 1 year, 9 months ago
SEO data analysis 1 year, 9 months ago
BrightonSEO conference 1 year, 10 months ago
HTTP2 on top sites 2 years, 1 month ago
Desktop & mobile performances 2 years, 5 months ago
Alexa top 1 million sites 2 years, 6 months ago
1 million #SEO tweets 3 years, 6 months ago
SEO, six blind men & an elephant 3 years, 7 months ago
Technical SEO log analysis 3 years, 8 months ago
3 ways for free https 3 years, 9 months ago
Crawl dictionary 3 years, 10 months ago
Https on top sites 3 years, 11 months ago
SEO web server log files 3 years, 11 months ago
Hsts on google.com 4 years ago

Recent Tweets