Measuring the Internet
Properly analyzing good data can help you improve your Web site
by Philip Stein
Virtually any activity can be measured, and there are many parameters of an activity that can be measured. When we decide to accomplish something, especially in business, we like to know whether we accomplished it and how well we succeeded. This can get tricky because you are likely to need a lot of technical knowledge and understanding of the product being measured to get it right. This is especially true when measuring the Internet.
Measuring the Internet has been a subject of great interest to me recently because I'm trying to improve my own Web site, and I'd like to know if my changes are having any effect. There are other things related to the Web that can be measured. For example, I don't have a high speed connection to the Net, but even so, sometimes it seems extra slow. I'd love to know whether in fact there is a problem and if there's anything I can do about it from my end.
Two things to measure
Measuring the effectiveness of a Web site is one of a whole class of measurements I will call "producer's data" because this class is mostly of interest to content producers. Measuring download speed (network performance) or accuracy (relevancy) of a search could be named "consumer's data" because it measures the quality of the user's experience.
Deciding how to make either kind of measurement requires knowing something about how the Internet, especially the World Wide Web, works. If you understand a few of the details, some measurements and processes that seem to make no sense will become quite clear.
How the Web works
Communications on the Internet are made up of messages. Your e-mail message to a friend or your purchase request to a commercial site is broken up into short packets, sent over the wires (or fibers or other means) and reassembled in the correct order at the receiving end. Most of the time, this process is invisible to the user and to the service provider--each end just sends and receives messages.
This process is called "transmission control protocol/Internet protocol" or "TCP/IP" for short. TCP/IP is the underlying message transfer system for all traffic on the Internet; e-mail, file transfer and Web surfing are just samples of what you can find there.
The most important thing you need to know to understand Web site performance measurements (but not necessarily other kinds of traffic) is each Web message is self-contained and is not associated with other messages you might send, even to the same server one minute later. This is the foundation of the World Wide Web and the genius of the hypertext transfer protocol (HTTP).
My server doesn't need to know who or where you are or when you will be back; it just responds to each request from anywhere as it happens. TCP/IP can support ongoing sessions consisting of multiple messages, but HTTP--which uses TCP/IP for transmission (as do other processes such as e-mail)--does not have sessions.
If you are surfing a Web site like mine, for example, you may send a request to view a page, and my server will return that page to your browser. As soon as that page is sent, my server completely forgets it is dealing with you. If you want to follow one of my site's links and see another page, you must send me another message.
This makes measuring producer data difficult. Suppose I simply count page requests. I can't tell whether you went to the site, discovered it wasn't what you were looking for and never returned. I would like to differentiate that kind of hit-and-run activity from the kind where you linger, check out a few pages, perhaps revisit some you've seen before and then e-mail me with a question.
The trouble is each time you ask for a page, the Internet doesn't distinguish your request from anyone else's. There is no inherent session that can be counted as a single larger entity that is an aggregation over time of several page requests. Web sites that need to follow a track of several inquiries through a session will place a cookie on your browser so it knows who you are. Cookies are small digital ID tags placed on client machines by servers and are used by almost all e-business sites.
To add confusion, every element on the page actually generates its own request, so a dozen or two messages (hits) are reported by the server from the loading of one page. All the eye candy surrounding the message, such as graphics and pictures, are made up of separate little fragments of image. If I try to count raw hits, I may count 20 or 30 for each page sent.
On top of that, browsers have caches--little pieces of memory used to store images and other fragments in your own computer so they don't have to be transmitted again. Sometimes a page request will generate no hits at all but will simply reload from the cache.
This Web architecture means in order to measure user behavior, such as how long visitors spend on your site, you need some statistical interpretation software such as Accrue's Hitlist, www.accrue.com, or Media House's Live St@ts, www.mediahouse.com, on your server.
Statistical models define a session, for example, as all HTTP requests from a single source with less than a 15 minute gap in the timing of their receipt. Although this is an arbitrary definition of a session, it is an accepted one. Interpretation also yields your entry page and exit pages, so producers can learn, in aggregate, where users enter and leave the site.
Theoretically, each session can be monitored. In practice, Webmasters are more interested in tracking users who arrive at the site via specific links, banners or affiliates.
For consumer's data, TCP/IP has a number of capabilities not often used by the casual Web surfer but convenient for measurement. A ping command followed by a Web site address (universal resource locator) will return a measurement of the round-trip time to that address and back. If nothing happens, the address is incorrect or is there but not responding.
Programs like Net Medic, www.netmedic.com, use ping and other invisible tools to diagnose your entire network while it's running. Data from programs like this will tell you how much bandwidth you're getting from your connection, how fast the server with which you're communicating is responding and where there are blockages. Most of the time there's little you can do about these problems, but detailed information like this may assist your Internet service provider (ISP) when it is attempting repairs.
Another helpful tool
The most important producer's tool is a statistics server. A sophisticated one will tell you how many sessions (not just hits) your site has received and will also tabulate them according to some interesting parameters: How long did a user stay on? Which pages of your site did he or she visit? Which buttons were clicked? These are just a few.
The best statistics servers will even attempt to identify the path by which each user arrived at your site. For example, this month, 26 users went to www.yahoo.com and clicked "measurement" to find you. This can be a tremendous help in determining where your best leads come from and can enable you to point up certain keywords so they will attract the most attention.
Many of the jumps to your pages will be from other pages on your own site, as users click around. Some of these data need to be interpreted carefully because some ISPs attempt to hide the identity of a surfer. For example, I was curious as to why I had such a large population of users from the state of Virginia until it was pointed out that America Online (AOL) is located there. Any hit from an AOL user is simply shown by my statistics server as coming from AOL. Still, these data contain a wealth of information.
One of these days, I'm going to control chart these data to see if changes in listing tactics or Web site content make statistically significant changes in viewer patterns. If so, I'll have a truly powerful tool for continual improvement.
Actually, the requesting server information contained in a TCP/IP request sends many clues about the identity of the user requesting the information, such as IP address, browser ID, operating system in use by the client, reverse domain name server information and more. These clues are recorded in the server logs along with date and time and can be used to isolate sessions.
Several software products specialize in analyzing server logs. One of the leading suppliers is WebTrends, www.webtrends.com. This company's product allows you to do the analysis on or off the server and presents the results as Web pages, graphing trends and other information.
Keywords provide useful information
The most useful pieces of information in these data sets are the keywords used. Your goal when designing and implementing a Web site is to attract lots of attention from your target market, without a great deal of traffic from surfers who aren't interested. If you can identify which words are succeeding in getting users to you and relate these words to how much interest users show (for example, by looking at many pages beyond your home page), you can get a real measurement of which search terms are working best.
Once you have this information, you can try to do a search on the way in which these keywords find you rather than someone else. First, you can type a term into any browser to see if any given search engine lists you at all.
For example, I type "link:http://www.measurement.com" (don't type the quotation marks), and any links that engine has listed are returned to me. Armed with this information, you can tune your site by adding text references on the home page and other pages that use the words you are trying to emphasize.
Search engines use spiders, which are programs that crawl through your Web site looking for keywords. Within limits, the more uses a spider finds, the higher it will score you for that term and the more likely you will be at the top of the charts when someone performs a search on that term.
It's also instructive to type your favorite word into a search engine, see what other sites show up and visit those sites. Looking at the top results this way can give you tips as to what those sites did to get the user's attention.
There's an even easier way to rise to the top for a search term: Pay for it. Search engines such as GoTo, www.goto.com, allow listed sites to bid for their choice of keywords. You pay nothing until the site delivers you a reference, and even though it's organized by highest bidder, the priciest search terms bring only 20 cents or so per hit. That's cheap, considering such a user is already pretty well qualified because he or she is searching for your top keyword.
Measurements are the basis for process improvement. By taking good data and analyzing them properly, we have the information we need to improve, and we can demonstrate our improvement activities, on the Internet or anywhere else, are having the desired effect.
Many thanks are due to Jonah Stein, email@example.com, for his great advice on search engine marketing and for feedback on the preparation of this article.
PHILIP STEIN is a metrology and quality consultant in private practice in Pennington, NJ. He holds a master's degree in measurement science from the George Washington University in Washington, DC, and is an ASQ Fellow.