When we search "Common Crawl" on google,  knowledge graph states that "Common Crawl is a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month. Common Crawl was founded by Gil Elbaz."

On their most recent blog post  which is at URL  http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/  , there is a file open to the public, 

cc-main-2018-aug-sep-oct-domain-ranks.txt.gz (1.89 GB) providing 87 million domains most recent harmonic centrality and pagerank values.

Preview of common crawl 2018 august,september,october domain ranks file

Below you can see a preview from this file. The column names are modified for ex: #pr_val  becomes pr_val.

# harmonicc_pos harmonicc_val pr_pos pr_val host_rev n_hosts
0 1 24993276.0 2 0.012750407517909759 'com.facebook' 7348
1 2 24671056.0 1 0.01721003017179015 'com.googleapis' 1904
2 3 23453366.0 3 0.010760718435453807 'com.google' 3328
3 4 22371572.0 4 0.008252403881908257 'com.twitter' 1138
4 5 22136836.0 5 0.006785530343703675 'com.youtube' 2985
... ... ... ... ... ... ...
87,160,627 87160628 0.0 87122763 4.481314067744052e-09 'zw.org.mdc' 1
87,160,628 87160629 0.0 87122764 4.481314067744052e-09 'zw.org.partnersforlife' 1
87,160,629 87160630 0.0 87122765 4.481314067744052e-09 'zw.org.yamurai' 1
87,160,630 87160631 0.0 87122766 4.481314067744052e-09 'zw.org.yard' 1
87,160,631 87160632 0.0 87122767 4.481314067744052e-09 'zw.org.youthalive' 1

Statistics of common crawl 2018 august,september,october domain ranks data

  • Pagerank 

mean, min, max of pr_val  = 1.1473075614418515e-08, 4.48131407e-09, 1.72100302e-02

  • Harmonic centrality

mean, min, max of harmonicc_val  = 9421776.2697027,  0. , 24993276.

  • Number of hosts (subdomains)

mean, min, max of n_hosts  = 10.363855461718083, 1.0000000e+00, 2.6061259e+07

  • Correlations

correlation(pr_val, harmonicc_val) =  0.00432823

correlation(pr_val, n_hosts)  =  0.02463352

correlation( harmonicc_val, n_hosts)  = 0.00068831

Data visualization 

Distribution of pagerank

The graph below presents the plot of the count of pagerank values. It shows us that the distribution of pagerank on 87 million domains is highly right skewed meaning the majority of the domains have very low pagerank.

Plot 87 million domains pagerank

Distribution of number of hosts

The graph below presents the plot of the count of n_hosts column values. It shows us that the distribution of number of hosts (subdomains) of 87 million domains is highly right skewed meaning the majority of the domains have low number of subdomains

Closer look at n_hosts with limited number of hosts between 100 and 2000, we observe the same type of distribution.


Distribution of harmonic centrality 

The graph below presents the plot of the count of harmonicc_val column values. It shows us that the distribution of harmonicc_val on 87 million domains is not highly right skewed like  the pagerank or number of hosts distributions. It is not a perfect gaussian distribution but more gaussian than the distributions of pagerank and number of hosts. This distribution is multimodal.


Scatter plot of pagerank and harmonic centrality 

As the majority of domains have low pagerank, we see a vertical red line when we scatter plot the pagerank and harmonic centrality values of domains but we observe the detachement of the domains' pagerank values from the masses begins when their harmonic  centrality value is closer to 1e7 and accelerates when it is greater than.

Pagerank V Harmonic Centrality

Scatter plot of pagerank and harmonic centrality by number of hosts

On this scatter plot of pagerank and  harmonic centrality values, red points show domains with n_hosts less than 10, green points show domains with n_hosts greater than or equal to 10. 

Pagerank V Harmonic Centrality by Number of Subdomains

Querying domains

Top domains in US

amazon.com, ebay.com, reddit.com

{'harmonicc_pos': array([22]),
 'harmonicc_val': array([17583026.]),
 'pr_pos': array([37]),
 'pr_val': array([0.00084049]),
 'host_rev': array(['com.amazon'], dtype='<U83'),
 'n_hosts': array([749]),
 'index': array([21])}
{'harmonicc_pos': array([241]),
 'harmonicc_val': array([16102642.]),
 'pr_pos': array([206]),
 'pr_val': array([0.00010764]),
 'host_rev': array(['com.ebay'], dtype='<U83'),
 'n_hosts': array([936]),
 'index': array([240])}
{'harmonicc_pos': array([61]),
 'harmonicc_val': array([16686224.]),
 'pr_pos': array([105]),
 'pr_val': array([0.00028783]),
 'host_rev': array(['com.reddit'], dtype='<U83'),
 'n_hosts': array([1535]),
 'index': array([60])}

Top domains in UK

amazon.co.uk, ebay.co.uk, bbc.co.uk

{'harmonicc_pos': array([230]),
 'harmonicc_val': array([16126449.]),
 'pr_pos': array([201]),
 'pr_val': array([0.00011029]),
 'host_rev': array(['uk.co.amazon'], dtype='<U83'),
 'n_hosts': array([76]),
 'index': array([229])}
{'harmonicc_pos': array([1501]),
 'harmonicc_val': array([15403167.]),
 'pr_pos': array([1730]),
 'pr_val': array([1.98171142e-05]),
 'host_rev': array(['uk.co.ebay'], dtype='<U83'),
 'n_hosts': array([330]),
 'index': array([1500])}
{'harmonicc_pos': array([108]),
 'harmonicc_val': array([16438657.]),
 'pr_pos': array([169]),
 'pr_val': array([0.00014236]),
 'host_rev': array(['uk.co.bbc'], dtype='<U83'),
 'n_hosts': array([342]),
 'index': array([107])}

Top domains in France

leboncoin.fr, orange.fr, amazon.fr

{'harmonicc_pos': array([8419]),
 'harmonicc_val': array([14927038.]),
 'pr_pos': array([16310]),
 'pr_val': array([1.98940531e-06]),
 'host_rev': array(['fr.leboncoin'], dtype='<U83'),
 'n_hosts': array([45]),
 'index': array([8418])}
{'harmonicc_pos': array([14503]),
 'harmonicc_val': array([14825333.]),
 'pr_pos': array([2288]),
 'pr_val': array([1.37111161e-05]),
 'host_rev': array(['fr.orange'], dtype='<U83'),
 'n_hosts': array([2860]),
 'index': array([14502])}
{'harmonicc_pos': array([907]),
 'harmonicc_val': array([15575607.]),
 'pr_pos': array([681]),
 'pr_val': array([4.26634194e-05]),
 'host_rev': array(['fr.amazon'], dtype='<U83'),
 'n_hosts': array([40]),
 'index': array([906])}

Top domains in Turkey

sahibinden.com, hurriyet.com.tr, n11.com.tr

{'harmonicc_pos': array([20895]),
 'harmonicc_val': array([14758365.]),
 'pr_pos': array([38627]),
 'pr_val': array([8.94421713e-07]),
 'host_rev': array(['com.sahibinden'], dtype='<U83'),
 'n_hosts': array([1044]),
 'index': array([20894])}
{'harmonicc_pos': array([15263]),
 'harmonicc_val': array([14816253.]),
 'pr_pos': array([5344]),
 'pr_val': array([5.60567653e-06]),
 'host_rev': array(['tr.com.hurriyet'], dtype='<U83'),
 'n_hosts': array([144]),
 'index': array([15262])}
{'harmonicc_pos': array([8034921]),
 'harmonicc_val': array([11943149.]),
 'pr_pos': array([1077872]),
 'pr_val': array([4.68284367e-08]),
 'host_rev': array(['tr.com.n11'], dtype='<U83'),
 'n_hosts': array([5]),
 'index': array([8034920])}

My blog's domain


{'harmonicc_pos': array([17769587]),
 'harmonicc_val': array([11030533.]),
 'pr_pos': array([3314413]),
 'pr_val': array([1.94330501e-08]),
 'host_rev': array(['com.searchdatalogy'], dtype='<U83'),
 'n_hosts': array([1]),
 'index': array([17769586])}

The domain with maximum number of subdomains

I was curious about the domain having max number of n_hosts value and I queried to find out which domain it is, the answer is  everyone.domains
{'harmonicc_pos': array([22768913]),
 'harmonicc_val': array([10713943.]),
 'pr_pos': array([54517530]),
 'pr_val': array([4.58480231e-09]),
 'host_rev': array(['domains.everyone'], dtype='<U83'),
 'n_hosts': array([26061259]),
 'index': array([22768912])}

List of domains with more than or equal to 10 K subdomains

List of top 20 domains having n_hosts >= 10000

array(['com.wordpress', 'com.blogspot', 'com.tumblr', 'com.yahoo',
       'com.github', 'com.gstatic', 'com.amazonaws',
       'com.googleusercontent', 'com.weebly', 'net.cloudfront',
       'io.github', 'net.doubleclick', 'com.appspot', 'com.squarespace',
       'com.deviantart', 'net.sourceforge', 'com.googlecode', 'com.wix',
       'com.live', 'com.list-manage'], dtype='<U83')

Majestic million data 

Majestic provides open public data of top 1 million domains at this URL : http://downloads.majestic.com/majestic_million.csv

Preview of majestic million file

# globalrank domain tld refsubnets refips
0 1 'google.com' 'com' 481744 3048605
1 2 'facebook.com' 'com' 467244 3085825
2 3 'youtube.com' 'com' 427535 2507495
3 4 'twitter.com' 'com' 417571 2494867
4 5 'microsoft.com' 'com' 313090 1188308
... ... ... ... ... ...
999,995 999996 'bauordnungen.de' 'de' 358 491
999,996 999997 'helios.eu' 'eu' 358 491
999,997 999998 'chinabi.net' 'net' 358 491
999,998 999999 'adammilstein.org' 'org' 358 491
999,999 1000000 'beckers.se' 'se' 358 491

Statistics of majestic million data

  • Refsubnets 

mean, min, max of refsubnets  = 1068.226617, 3.58000e+02, 4.81744e+05

  • Refips

mean, min, max of refips  = 1440.785989, 3.640000e+02, 3.085825e+06

  • Correlations

correlation(refsubnets, refips) =  0.87576101

Merging with majestic million data 

After converting domain information to host_rev in majestic data as mhost_rev, I summed up refips and refsubnets of majestic's host_rev and remove duplicates.

Preview of majestic million data after transformation

Below is the preview of majestic million data after this tranformation

# mhost_rev refips_sum refsubnets_sum
0 'com.google' 8821359 2337390
1 'com.facebook' 3501915 651350
2 'com.youtube' 2507495 427535
3 'com.twitter' 2640942 496545
4 'com.microsoft' 1560747 495154
... ... ... ...
998,411 'de.bauordnungen' 491 358
998,412 'eu.helios' 491 358
998,413 'net.chinabi' 491 358
998,414 'org.adammilstein' 491 358
998,415 'se.beckers' 491 358

Preview of common crawl and majestic million data join

Later I merged two data sets. Below is the preview of this join

# harmonicc_pos harmonicc_val pr_pos pr_val host_rev n_hosts mhost_rev refips_sum refsubnets_sum
0 1 24993276.0 2 0.012750407517909759 'com.facebook' 7348 'com.facebook' 3501915 651350
1 2 24671056.0 1 0.01721003017179015 'com.googleapis' 1904 'com.googleapis' 59541 35878
2 3 23453366.0 3 0.010760718435453807 'com.google' 3328 'com.google' 8821359 2337390
3 4 22371572.0 4 0.008252403881908257 'com.twitter' 1138 'com.twitter' 2640942 496545
4 5 22136836.0 5 0.006785530343703675 'com.youtube' 2985 'com.youtube' 2507495 427535
... ... ... ... ... ... ... ... ... ...
87,160,627 87160628 0.0 87122763 4.481314067744052e-09 'zw.org.mdc' 1 -- -- --
87,160,628 87160629 0.0 87122764 4.481314067744052e-09 'zw.org.partnersforlife' 1 -- -- --
87,160,629 87160630 0.0 87122765 4.481314067744052e-09 'zw.org.yamurai' 1 -- -- --
87,160,630 87160631 0.0 87122766 4.481314067744052e-09 'zw.org.yard' 1 -- -- --
87,160,631 87160632 0.0 87122767 4.481314067744052e-09 'zw.org.youthalive' 1 -- -- --

Preview of final dataset

After droping rows which contain missing values,  953972 domains are left. Below is the preview from this final dataset

# harmonicc_pos harmonicc_val pr_pos pr_val host_rev n_hosts mhost_rev refips_sum refsubnets_sum
0 1 24993276.0 2 0.012750407517909759 'com.facebook' 7348 'com.facebook' 3501915 651350
1 2 24671056.0 1 0.01721003017179015 'com.googleapis' 1904 'com.googleapis' 59541 35878
2 3 23453366.0 3 0.010760718435453807 'com.google' 3328 'com.google' 8821359 2337390
3 4 22371572.0 4 0.008252403881908257 'com.twitter' 1138 'com.twitter' 2640942 496545
4 5 22136836.0 5 0.006785530343703675 'com.youtube' 2985 'com.youtube' 2507495 427535
... ... ... ... ... ... ... ... ... ...
953,967 87153912 0.0 87116072 4.481314067744052e-09 'za.co.landmate' 1 'za.co.landmate' 711 475
953,968 87153921 0.0 87116081 4.481314067744052e-09 'za.co.langebaancpf' 1 'za.co.langebaancpf' 856 595
953,969 87153940 0.0 87116100 4.481314067744052e-09 'za.co.lasercorp' 1 'za.co.lasercorp' 915 737
953,970 87154893 0.0 87117051 4.481314067744052e-09 'za.co.misternat' 1 'za.co.misternat' 441 369
953,971 87160530 0.0 87122665 4.481314067744052e-09 'zw.co.helpstarsmedicaltrust' 1 'zw.co.helpstarsmedicaltrust' 1119 925

Statistics on the final 954 K domains

  • These correlations below show us that refsubnets and refips of domains are correlated; refips more than refsubnets to pagerank values but number of hosts again  as seen in the beginning is not correlated directly to pagerank. 

correlation("pr_val", "refips_sum"),correlation("pr_val", "refsubnets_sum"),correlation("pr_val", "n_hosts")

0.60769659, 0.5162285, 0.06663268

  • These following correlations  show us that refsubnets, refips  and number of hosts of domains are not strongly correlated to harmonic centrality values.

correlation("harmonicc_val", "refips_sum"),correlation("harmonicc_val", "refsubnets_sum"),correlation("harmonicc_val", "n_hosts")

0.06194632, 0.11714723, 0.01232035


We can  add some  more data on this dataset as geographical information of location of the domains' hostings or their webperformances etc. We can create an ML classificiation or prediction models with the final data. Some detailed data analysis can be done on tld level which can reveal surprising insights too.


