How website tracking should work

Week 98 was posted by Charanjit Chana on 2019-09-09.

I've been thinking about website tracking for a while now and it's something that I have yet to find a suitable solution for here at 1 Thing A Week. Our digital privacy is so important and yet it's traded openly and freely. We are the commodity and we're all to willing to give it all up for free and unlimited access.

I remember a time when Google's AdSense platform scanned your content so that they could serve relevant ads and I found this to be pretty unobtrusive and a "fair" way to monetise my content. Over the years and thanks to a number of acquisitions, Google started displaying targeted ads instead which now means you are the target, rather than my content. They're not alone and it's only getting worse. The same is probably true of Gmail (assuming Google weren't actually evil from the very beginning) where emails were scanned to deliver relevant ads. Now, they also scan emails to present a TL;DR summary for things like online orders so the line is blurred even more. It's a great feature but at what cost?

The solution

I really think there's a solution to handling tracking better, if, privacy is taken seriously enough. Now, one sticking point is where the data should be held, but perhaps that's where someone like the Mozilla Foundation comes in. We trust Google (and others) with our tracking data and unfortunately it's an aspect of wanting to understand visitor numbers that cannot be avoided.

So assuming that we have a trusted third party, let's call them 3P, here's how website tracking should work:

Browsers should anonymously share the data with 3P, where website owners can verify that they own a domain and retrieve the figures. Verification could be handled in the same way that Google and others already do so, either with a special domain record or a particular file with a unique name and verifiable content uploaded to the site.

Apple, Mozilla, Google, Microsoft, Opera and anyone else that builds a web browser should hit an exposed endpoint that 3P that they hit on every page load with the following:

Current page's URL + title
The URL of the page that the user navigated from
Browser name + version
Location (if user has opted in)

Now, this would give you enough info to store page views, but not a unique visitor count. To counter that, a ID that has been encrypted could be assigned to each browser to help determine that. Something that users could refresh at any time or is regenerated on launch. The ID could be generated using the following data:

Browser name + version
Unique ID generated by or assigned to the browser and saved until the user refreshes it manually, re-launches the browser or at a certain interval (hourly, daily, weekly)
An encryption key or by salting the ID

Here's a crude example of what would be created if the following bit of text (browser name and version number (build in brackets), followed by a unique ID) was encrypted, using the phrase "100 Things A Week" as the key:

Safari 12.1.2 (14607.3.9) 8237598237423

This would be converted from the above text into a unique identifier like so:

Oqh9Jfw9FHLczdi+7wUIq2G4eg8jvRiwRjDb2BhxjTcS3HjJWluL8MJjz1rc8+Ja

So for every page view I registered while my browser's personal ID was valid, would be registered against this. This is probably not far off what Google do already relying on JavaScript to handle the data isn't perfect. I've seen huge, false, spikes in traffic thanks to bots that have somehow triggered the tracking code. This is avoided when the browser is in control rather each site having to decide what works best for them.

What about AJAX and interactions?

Just another bit of data that needs to be captured. Is this a new page or a partial load of content. As for interactions, sounds to me like a new attribute is required for HTML that will track clicks on any element is assigned to:

<button track-click="true" track-title="my interaction">
Track this button
</button>

Any clicks on this button would be tracked with an attribute that tracked which now gives us three possibilities for what is passed to 3P:

1. Natural navigation

{
    "type": "natural",
    "url": "https://www.1thingaweek.com/market",
    "tItle": "Market",
    "referer": "https://www.1thingaweek.com",
    "browser": "Safari 12.1.2 (14607.3.9)",
    "location": "London, UK"
}

2. AJAX call

{
    "type": "ajax",
    "url": "https://www.1thingaweek.com/metadata.json",
    "referer": "https://www.1thingaweek.com",
    "browser": "Safari 12.1.2 (14607.3.9)",
    "location": "London, UK"
}

3. Natural navigation

{
    "type": "interaction",
    "title": "interaction name",
    "referer": "https://www.1thingaweek.com",
    "browser": "Safari 12.1.2 (14607.3.9)",
    "location": "London, UK"
}

For the majority of website owners and administrators this is more than enough information and could help rid the web of unnecessary JavaScript file inclusions and processing. I really don't need all the data that Google captures about you.

And what about legacy browsers?

Personally, it's a hit I'd be willing to take.

According to StatCounter, legacy browsers don't even register. For user's own safety online they shouldn't be using them anyway.

Right now, it's a dream but it doesn't feel like something that is out of reach.

Data consumption

I would hope an approach like this would cause less data to be sent and received but if this was all in a browser's control they could offer an option to cache the data in app and only send it when you're connected to wifi. Real time stats would suffer, but with a big enough grace period, it wouldn't matter.

Alternatives

I had been trialing Plausible which I really liked. I was a beta tester but it was missing one feature I needed because of the way I treat weekly posts. I don't show an aggregate of content on the homepage, instead of https://www.1thingaweek.com the address is technically https://www.1thingaweek.com/week/98/how-website-tracking-should-work and in Google Analytics I have control over this. Plausible doesn't, yet but without it's almost impossible to gauge interest in a topic while it's the current week's article but I would say for most people it would be perfect.

There are self hosted tools, but they've not been accurate at all either in the number of visits or visitors.

So right now, I'm still looking for the right tool.

Tags: development, privacy, tracking, development, privacy, tracking