SEO data distribution analysis

SEO Data Distribution Analysis

Before analyzing the distribution of SEO data, it is best to talk briefly about what SEO data is.

There are many sources of SEO data, but two are essential. One of them is data collected from websites crawls, the other one is data collected through web server logs of websites.

In this blog post, I would like to focus on analyzing the distribution of SEO data collected from the web server logs.

Again through the web server logs, it is mainly possible to obtain two types of data. One concerns the data on search engine bot crawls of the websites, which are crawled pages data, the other one is the search engine user visits to this website, which are  active pages data.

This blog post presents the SEO data distribution analysis of crawled and active pages data. This data distribution analysis is also available as jupyter notebook on github SEO Data Distribution Analysis.

The data distribution analysis can be helpful in our SEO work. In these type of data analysis we can see that the SEO data distribution per year or per category of the pages are different. Observing the differences in SEO data distribution in different time frames or page categories can be helpful in segmentation of the SEO data. The data distribution analysis can also be useful in estimating  roughly missing SEO data. For example, if we know certain category of pages have right-skewed daily crawl data distribution, in estimating missing daily crawl data of these type of pages we can take the median not the mean of these category of pages data. Or else if the data distribution is normal distribution then taking the mean of the data will be more accurate for estimating missing daily crawl data.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_crawl = pd.read_csv("crawl.csv",parse_dates=['date'], index_col='date')
df_active = pd.read_csv('active.csv',parse_dates=['date'], index_col='date')
df_sum_crawl = df_crawl.resample('D').sum()
df_sum_active = df_active.resample('D').sum()
df_sum = pd.concat([df_sum_crawl, df_sum_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-01

df_sum.hist(figsize=(8,6));

seo-data-distribution-analysis-02

df_sum['2016-08-01':'2017-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-03

df_sum['2017-08-01':'2018-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-04

df_sum['2018-08-01':'2019-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-05

scaler = MinMaxScaler()
include = ['/blog/']
exclude = ['/tag/','/archive/','page','.html','/category/','/en/','/admin/','author','//']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))]
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))]
df_select_crawl = df_select_crawl[(~df_select_crawl.request_url.str.endswith('/blog/'))]
df_select_active = df_select_active[(~df_select_active.request_url.str.endswith('/blog/'))]
df_select_crawl = df_select_crawl[(~df_select_crawl.request_url.str.contains('|'.join(exclude)))].resample('D').sum()
df_select_active = df_select_active[(~df_select_active.request_url.str.contains('|'.join(exclude)))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

blog articles crawl active pages data

df_sum['2019-01-01':'2019-08-01'].hist(figsize=(8,6));

blog articles crawl data distribution

df_sum['2019-01-01':'2019-08-01'].median()

crawl 0.076433
crawl unique 0.219512
active 0.423729
active unique 0.272727

dtype: float64

df_sum['2019-01-01':'2019-08-01'].mean()

crawl 0.095183
crawl unique 0.251002
active 0.420625
active unique 0.282757

dtype: float64

scaler = MinMaxScaler()
include = ['/tag/','/archive/','page','/category/']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))].resample('D').sum()
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-category-crawl-waste

df_sum.hist(figsize=(8,6));

seo-data-distribution-analysis-09

df_sum['2019-01-01':'2019-08-01'].median()

crawl 0.018182
crawl unique 0.024390
active 0.000000
active unique 0.000000

dtype: float64

scaler = MinMaxScaler()
include = ['.html']
df_select_crawl = df_crawl[df_crawl.request_url.str.contains('|'.join(include))].resample('D').sum()
df_select_active = df_active[df_active.request_url.str.contains('|'.join(include))].resample('D').sum()
df_sum = pd.concat([df_select_crawl, df_select_active], axis=1)
df_sum[['crawl','crawl unique','active','active unique']] = scaler.fit_transform(df_sum[['crawl','crawl unique','active','active unique']])
df_sum[['crawl unique','active unique']].plot(figsize=(15,8));

seo-data-distribution-analysis-10

df_sum['2019-01-01':'2019-08-01'].hist(figsize=(8,6));

seo-data-distribution-analysis-11

Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn

Have comments, questions or feedback about this article? Please do share them with us here.

If you like this article

Follow Me on Twitter

Follow Searchdatalogy on Twitter

Comments

About Us

My objective is bringing all my experience and expertise together to deliver solid technology solutions that can take your search traffic acquisition to the next level. My main goal is to assist you in building and maintaining your search marketing analytics platforms. My will is to leverage your marketing and IT teams search knowledge while bridging the gap between two.

Certificates

Botify: Botify Certified Consultant

IBM: Data Scientist, Data Engineering Certificates

Google: Google Analytics, Google Adwords, Mobile Sites, Digital Sales Certificated Professional

Coursera: Data Engineering on Google Cloud Platform Specialization

Legal Terms Privacy

Recent Posts

SEO data distribution analysis 2 weeks, 2 days ago
SEO data forecasting 11 months ago
SEO data analysis 11 months, 1 week ago
BrightonSEO conference 12 months ago
HTTP2 on top sites 1 year, 2 months ago
Desktop & mobile performances 1 year, 7 months ago
Alexa top 1 million sites 1 year, 7 months ago
Best SEO conferences in 2019 1 year, 8 months ago
Web marketing festival 2 years, 2 months ago
Webcampday 2 years, 4 months ago
Queduweb 2 years, 4 months ago
1 million #SEO tweets 2 years, 7 months ago
SEO, six blind men & an elephant 2 years, 8 months ago
SEO hero 2017 2 years, 9 months ago
Digitalzone 2 years, 9 months ago
Technical SEO log analysis 2 years, 10 months ago

Recent Tweets