Data Science 101: Examining the Requests Made by the Top 100 Sites

For our latest installment of the insideBIGDATA Data Science 101 series, I thought I’d do something a bit different. Here is a sample analysis by data scientist and blogger Dan Goldin who published some nice results using R to assess the web requests originating from the top 100 Internet sites. It just goes to show you if you’re innovative in your use of publicly available data sets (in this case Alexa), you can use the principles of data science to gain insight into all sorts of things. Data tell no lies!

You can check out the full analysis HERE including a link to the R code on GitHub. Dan wrote a script to load the top 100 Alexa sites and capture each of the linked resources as well as their type. The data set he obtained contained the time it took the entire page to load as well as the content type for each of the linked files. He imported the data into R and computed some site performance metrics and visualizations such as:

  • Average page load time
  • Load time boxplot
  • Number of requests
  • Number of requests vs. load time (including a linear fit model)
  • File type frequency
  • File types by URL
  • File type correlation
  • Multiple linear regression

 

File Types Correlation Plot

File Types Correlation Plot

 

Sign up for the free insideBIGDATA newsletter.

 

 

Resource Links: