I am now brushing up on my python programming for my upcoming TA position. I’ve been playing with some interesting external packages such as numpy, scipy, networkx, and matplotlib.
Here is the first product of my few-days study of python visualizations.
The dataset used is called “MSNBC.com Anonymous Web Data Data Set“, which is a collection of web browsing log within MSNBC.com website in 1999. The pages within the site is classified as 17 different categories: frontpage, news, tech, local, opinion, on-air, misc, weather, msn-news, health, living, business, msn-sports, sports, summary, bbs, travel.
Dataset holds about one million records of tracking on each user. Every row looks like this.
1 1
2
3 2 2 4 2 2 2 3 3
5
1
6
1 1
6
6 7 7 7 6 6 8 8 8 8
The movie above is based on moving averages calculated as the program reads along the data file. The below is the visualization based on the entire dataset.
Quick-and-dirty findings are:
- Frontpage rocks. (Of course they should.)
- Basically, the whole structure looks like a hub structure. Strong links exist between frontpage and (news, business, on-air, sports, local, misc).
- Middle-level link between sports and msn-sports is seen, but virtually no link between news and msn-news is observed.
- Weather is isolated from other pages, but it induces a fair amount of clicks within itself.
- Etc. Etc….
For those who cannot see the legend on the movie, I would repeat it here. (edge width, node size, node color) represents (# of users passed, # of users visited, self-loop ratio) respectively.
Although the code is so dirty that I do not want to upload here, if you want to look at it, please let me know by leaving a comment. Thanks.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.





4 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.
vey niceeeeee
Thank you for visiting
Hello!
I found very interesting your website and thank you for doing in it in English..
Thanks for visiting!