As a web analytics solutions provider we often find ourselves in situations with customer websites, intranets, or web applications that make it difficult and time-consuming to implement effective web analytics.

This article describes the features of a website that will make it easier and less expensive to implement web analytics. It should be read by anyone involved in the process of building a website.

Terms

Within this document we will refer to a ‘website’, however in the day of ‘web 2.0’ and, the development of Rich Internet Applications (RIAs) the distinction between a web site and web application is becoming increasingly blurred. The web is simply a means of distribution and whenever I use the term website here, it may be taken to mean either a traditional website, intranet or web application.

Search Engine Optimisation (SEO)is the process by which a website is made more visible in search engines by a variety of methods // that the content of the website is ranked well by the search engines for relevant search terms. While SEO practices aren’t discussed in detail in this document we will see how a web anlytics friendly website should also be SEO friendly.

Features of bad websites

The following features will all conspire to make a website more difficult to implement web analytics for.

  • Horrible URLs
  • Inability to modify headers/footers on a site-wide basis
  • Incoherent database
  • Heavy use of Flash

Horrible URLs

As I first wrote this in 2008 – I saw urls like :

http://www.broadvision.com/bvsn/bvcom/ep/home.do?tabId=1&BV_SessionID=NNNN1162508521.1196775838NNNN&BV_EngineID=cccdaddlkehkmekcefecefedgfhdfnf.0

I don’t want to be unfair towards BroadVision, I’m sure their product is otherwise very nice, but it seems that anything implemented using the BroadVision engine ends-up with URLs looking like that(In case you didn’t know one of these is the “About BroadVision” page on their main .com website and the other is the home page (I’ll leave you to figure out which URL corresponds to which resource). (As of 2010 their site looks much nicer)

Surely that can’t be good, can it?

There are so many reasons why this looks wrong – firstly we have /bvsn/bvcom – assuming that we can infer from the request for www.broadvision.com that this is content for their .com website then why do we need /bvsn/bvcom as well?

/ep, home.do and tabId must be very important – but their functions aren’t terribly clear.

The long BV_SessionID and BV_EngineID parameters are perhaps the worst elements. There is every chance that BV_EngineID will vary depending on what the BroadVision platform is up to and BV_SessionID will vary between different visitors. This will mean that the same resource can be requested with many many different URLs.

Inability to modify page headers/footers on a site-wide basis

Most websites have clear templates, with site-wide headers and footers that can be modified simply to ensure that content is placed on every page. Normally this is used to include global navigation, copyright notices, logos – things that appear everywhere.

This is rarely a problem any more, but some sites still require changes to be made on a per-page basis in order to include something on every page.

Incoherent database

Even with difficult to understand URLs it is possible, via the Content Management System (CMS) to look-up the URL and determine what the page is actually about. Of course, if the database behind the CMS is undocumented and impossible to understand then this is either no longer an option or becomes more time consuming and expensive than it should.

Heavy use of Flash

If a website is implemented using Flash, as a Java applet or some other technology where the user is navigating content/features without the URL visible in the browser changing then it may be difficult to understand how much of the content within the Flash has been seen. It is possible and common to instrument websites like this, but it is time consuming. Furthermore, on the whole, search engines are unable to index this content and won’t be able to direct visitors directly to it.

Avoiding bad features

If you steer development away from the above then you will be left with a website that has the possibility to be both SEO and web analytics friendly.

  • Human (and search engine) readable URLs
  • Hierarchy (semantic URLs / clear ontology)
  • Single URL for single resource
  • Searches made using HTTP GET requests – the results page for different terms having different URLs
  • Ability to add pagetags, possibility via a template
  • Sensible database to look-up further information
  • Lightweight, accessible content where user activity is recordable

Good features

Firstly I’d like to go away and read the following page, written by Tim Berners-Lee in 1998. It won’t take you long and this will still be here when you’re done:

‘Cool URIs don’t change’: http://www.w3.org/Provider/Style/URI

If you’d like something less authoritative, but reinforcing the above then please try the following, written by Jakob Nielsen in 1999 :

‘URL as UI’: http://www.useit.com/alertbox/990321.html

Human readable URLs

You know that http://www.example.com/contact-us is probably a page showing contact details, perhaps with a form to allow you to contact the company directly. Search engines will figure this out as well. Just in the way that /news and /2008/news or /news/2008make sense.

There is a SEO benefit to having relevant keywords in the URL – without making it too long and spammy you should be able to make it readable and relevant.

The benefit that this has to web analytics is that a report should be available that will show the requests for each of these resources. If you can read the URL in the browser then you will be able to read it in the web analytics reports.

Hierarchy

Taking the following url http://www.example.com/polish {deliberately ambiguous} – it could be a product page about wax polish or it could be pages for Poland. Whereas […]/products/polish or, better still, […]/products/wax-polish mean one thing and […]/languages/polish} means something entirely different.

Establish a meaningful hierarchy to URLs, group products and pages together in a way that makes sense to at least part of your audience.

It may be that some pages belong in two places, that naturally fit in two places. If this is the case you could consider having two URLs for the same resource and redirect one to the other version.

Single URL for single resource

Each unique resource should have one and only one canonical URL. Any other URLs that may represent this resource should redirect to the primary version.

What’s the difference between http://www.example.com/ and http://www.example.com/index.html or […]/index.htm or […]/default.htm or […]/home.do or […]/page.asp?id=0? It may be that they all host the same content. This has a few implications:

  • Without the ‘one URL’ strategy
  • Web Analytics products may report more than one resource – with the traffic divided amongst the different version.
  • Visitors may get confused
  • SEO activity may diffuse the impact of efforts over many pages
  • Search Engines may de-list one or more of the duplicates
  • Even the small matter of a trailing slash (“/”) on a URL, where the website will return a resource with or without the slash present may cause problems. […]/my-products/ and […]/my-products are different URLs, possibly for the same resource.

One resource. One URL.

Simple on-site search

This is a simple one. If a simple keyword-based search is performed on the site itself make sure the search is submitted with a GET rather than a POST request. To test if this is happening make a simple search and look at the URL in the the browser.

Note the lack of querystring in the following example of a POST-based search (which we’re suggesting isn’t great) :

http://www.example.com/searchresults.aspx

Whereas the following is more useful :

http://www.mysite.com/searchresults.aspx?q=my%20search%20term

You can see there is a ‘q’ parameter, presumably standing for ‘query’. The term that is being searched for is ‘my%20search%20term’. The spaces have been url-encoded (this is okay) as ‘%20’ – this means the actual search term is ‘my search term’. This URL encoding is trivially decoded by analytics products and isn’t anything to worry about.

The usability bonus from a GET-submitted search is that a user can press ‘back’ in the browser and not be asked if they want to re-submit the request.

Where there’s a ‘filter’ search feature this gets more complex – there’s likely to be all sort of parameters introduced over POST or GET requests any and all of which are going to be of use from an analytics perspective.

Adding Page tags

A Page tag, in this case is a small section of code that should appear on every page of a website that references a file (Either on the site itself or hosted elsewhere). Page tags can do all sorts of wonderful things, but first you need to make sure that this code is present on every page.

In order to get this onto every page, perhaps the same for every page, perhaps changing on a per-page basis on some or all pages, you need to have some form of general ‘footer’ that it can be added to. This may be in the form of templates (so long as there aren’t too many) or single footer. If you are relying on a common footer on all pages make sure that there aren’t any special pages that don’t use it (even if they look like they do). The home page, search pages or any transactional page are likely candidates to watch for here.

In the case of product catalogues, shopping carts or any other instances of more complex functionality on a site that you’re interested in measuring, it is possible that they will not share the same footers/templates, so the issue of instrumenting these areas of a site should be considered.

Sensible database to look-up further information

It may not always be practical to record all information regarding a request, or the person that makes it at the time the request is made. Either the information is too complex, too large, it would impact performance of the application/site itself to make the information available or perhaps the information is too sensitive to make visible to the customer (for example the profit that an organization makes from a given item or transaction).

When this is the case looking this information up after the event has occurred may be necessary or desirable. There are three methods to make this information available to the analytics environment – by enriching the information prior to the analytics tool has loaded it, having the tool look the information up, or by integrating the information from the analytics too with some other source after the tool has loaded it.

In any of the above cases it may be necessary to take some element of information from the customer-facing application, such as the URL, product SKU, username, or transaction ID and use it to determine something from the back-end environment, such as a page name, product value, user address or profit margin for a particular transaction.

Tracking Marketing activity

It is typical to add parameters to the URLs of landing pages to indicate where the traffic originated and which marketing effort was responsible for the traffic (and resultant activity). One URL that’s appeared in the press recently is: http://www.example.com/article.do?catId=1&contentId=7091&wacam=ImportantCampaign&wasrc=Google&wagrp=Non_Branded_General&waadv=General&wapkw=key_word_here (obscured, of course) – Now, this is a horrible URL to start with, but in order to extract information regarding the activity that caused the request the wacal, wasrc, wagrp, waadv and wapkw parameters have been added. This is great from a measurement perspective as it allows us to attribute the activity, but at the cost of breaking the ‘One resource. One URL’ rule. This may not matter too much from an analytics perspective as most tools have some means of coping, but there will be a few other side-effects.

  • This URL, when passed around, will result in activity being recorded against one campaign that didn’t result from that campaign (one reason why I’ve obscured the link).
  • If this URL copied and re-posted (as it was) there may be some dilution of the SEO benefits of these links.
  • It looks messy and has the potential to confuse.
  • It exposes (very visibly) your internal campaign structure to the outside world.

Solution to URLs cluttered with campaign tracking codes}

One solution that I’ve seen to this issue is to send people initially to the URL with the tracking parameters in-place. Record the parameters (using, for example, a session cookie), Redirect the visitor (HTTP 301 or 302 (make sure you’re happy with the benefits of each)) to a clean, canonical URL and then present the parameters (extracted from the session cookie) to either the page tag (if in use) or logged to the webserver logs. The web analytics solution should then be able to read the values and present the same results, but without most of the negative side-effects of leaving the tracking in the visible URL.

There are two negative implications of this solution, other than the effort required. Firstly the overhead of the additional redirect. This has the potential to extend the time taken for the visitor to see the first page by a small amount with the obvious consequences. The more serious issue is the requirement for a means of passing the campaign tracking codes over the redirect, something that begs for an HTTP cookie, but the recent anti-cookie legislation will cause to be more problematic.

Conclusion

Crikey, I haven’t written a ‘conclusion’ for over a decade. How frightening is that? Anyway, the above are some things that can make a web analytics deployment more or less difficult. All of the problems can be overcome, sometimes with more £££, sometimes just with hard work or ingenuity either from us or the customer.

And there it ends for now. where should I take this next? What did I miss and what haven’t I corrected in light of all the changes in the previous four years?

I first started writing this article in 2008 – the ink of the reminder to write it has long since dried on the whiteboard by my desk. It was originally requested by Kassie Siwo-Gasa when she was working for her previous employer – so, even though it’s no longer needed I really ought to dedicate it to her and promise not to take as long next time.