Article ID: Q20051226-01
Q: How Google Analytics compares to log file analysis tools, such as Stone Steps Webalizer?
A:Web site usage reports generated by various analysis tools are so much different from each other, that it only makes sense to compare methods of collecting website usage data, but not the actual reports. Let's first review how usage data is collected in each case.
All servers, including web servers, produce log files that are usually used for security and traffic analysis purposes. A typical sequence of requests and responses resulting in a log entry added to the log file is shown on the picture below. At some point the log file is processed with a log file analysis tool, such as Stone Steps Webalizer, in order to generate web site traffic analysis reports.
Log file format differs greatly from a server to a server, but in general, most log files contain the information that identifies the visitor (e.g. IP address, user name, cookies), the requested resource (e.g. the request URL, including query strings), the type of the user agent (e.g. browser or spider type), the referring page (e.g. the URL of a page containing a link to the requested page), and some other data, such as the request method, request processing time, response size, etc.
Because this URL points to an image file located at www.google-analytics.com, it is easy to stay invisible to Google Analytics by blocking requests for this file or to this domain.
Website traffic analysis can only be as good as the logged data is. The table below lists side-by-side data items that can collected by each of the described methods. A blue checkmark indicates that this data item may be collected by the respective method. A gray checkmark indicates that while technically it is possible to collect this data item, some additional work is required in order to do so. A missing checkmark indicates that this data item cannot be collected by the respective method.
In general, web server log files contain information about the actual traffic served by the web server, while the method based on client-side scripting collects data about the actual traffic generated by the client. These two types of traffic are not the same because of the various caching devices and/or components located between the client and the server (e.g. if the requested page is served from the browser cache, it will not be logged, but the script on the page will still be executed).
Stone Steps Webalizer does not provide any support for processing data items marked with gray checkmarks, outside of the standard analysis functionality. That is, if a page (e.g. /page.html) contains some usage analysis script that sends a special request to the website monitored with Stone Steps Webalizer (e.g. /__a.gif?url=/page.html&lang=ja), this script-generated URL will be logged by the web server and processed by Stone Steps Webalizer as a standard URL and no additional correlation (i.e. that /page.html was requested by a Japanese visitor) will be made between the original page and the additional client-script-provided information about this page.
One important quality of the server-side logging is that all website activity is always being recorded. Of course, it is possible for a visitor to disguise the information that may help the website administrator to identify visitor's whereabouts or the browser configuration (e.g. the IP address, browser type, the referring page, etc), but the administrator will still always know whether somebody is trying to break into the forum administrative interface or hot-link website images or scrape some pages and take preventive measures.
Client-side scripting, on the other hand, does not have this quality. In fact, one does not even need any special software to stay completely invisible to Google Analytics and other client-side scripting-based analysis engines that use fake images to deliver usage data - simply checking Load images ... for the originating website only checkbox in the FireFox configuration (Tools > Options > Content) does the trick.
With this in mind, keep website logging turned on at all times, even if the logs are not being analyzed on the day-to-day basis.
Any client-side technology should be considered unreliable by definition, regardless whether it concerns form validation or website usage analysis. Any client-side technology that is not backed by some form of server-side support (e.g. server-generated digital signatures, etc) should be trusted even less and should be used for entertainment purposes only. Granted, some client-side scripting website usage analysis tools, such as Google Analytics, may provide quite interesting information about website visitors, however, such tools should be considered as a useful addition to the usual everyday log-based analysis, not as a replacement of it.