Introduction: Analytics on internal systems, and why it’s important!

The modern web is all about numbers, statistics, getting feedback fast and adapting to the feedback to make the experience that our users have when visiting out sites better, and having them come back to the site more often – which translates into sales / conversions. We advocate to our clients the crucial need to have a good analytics toolkit installed on their system. Something that measures page views, events, technology, referrals, goal funnels, and many more metrics. Each one of these metrics is a necessary and very useful part of the “big picture” that we grow of out site, how people are using it and what to do (and not to do) in order to guarantee success.

Here’s the key question, though: Why do so many companies forget that their internal systems also has users?

By users, I mean mere mortal human beings that have opinions about the software, get into bad (and good) habits, prefer certain technologies and uses our systems in their own way, as opposed to the way that we thought or intended them to use it.

At Afrihost we have internal systems just like any other organisation. We are in the lucky position that our internal systems are all web-based (well, most of them anyways). Having web-based internal systems means quite frankly that putting something like Google Analytics, Piwik, Clicky, Open Web Analytics – to name but a few – on your system is very easy and an absolute must.

Why?

To answer the “why” is very difficult and different for every implementation. It’s different because each implementation of analytic tracking tries to answer a specific question. Once the question is answered, something is done about the answer (we either change something or we tap ourselves on the shoulder for a good job done – the formar is more often the case). Then we move on to ask the next question. Often, however, we don’t know the question until we see a problem staring us in the face. We can only see the problem if we track it somehow. Let me explain a few “why” or “problems” or “answers” by way of example:

Know what technology do people use without asking!

Initially when we started our “intranet” or “internal system”, we went through the same pains as most Web Software Development groups go: Why isn’t this working in IE? Why does it look this way in FireFox but in Chrome it’s all broken? Why do we even bother with Safari? Wouldn’t it be cool if we can just stick with one platform? We then realised that we actually have to decide on a single web platform to use: It’s was going to be less costly to have everybody standardise on one platform (at the time FireFox), than to spend hundreds of hours in software development to plan and develop for multiple browsers. (Remember, this was the time of IE 6 and 7 which was crap to say the least – and that’s actually giving them a compliment).

Lucky for us, we put in Google Analytic tracking in our internal system a few weeks before that realisation – not because we thought we needed Google Analytics, but purely because one of the developers thought it was really cool to see all the pretty graphs and stuff (true story!).

We went into Google Analytics and saw to our surprise that the majority of users actually did user Firefox… and the ones who used IE were on the newer IE7 which was less of a tirant than IE6. This made the decision to standardise everybody on Firefox much easier because we could take the stats to our bosses and say: “See, most of the company are on FF anyways, why don’t we just standardise on that?”. There was the usual pushback from IE-die-hard-kinds, but each time we brought out the stats and the pretty pie charts (updated off course after each convert) they buckled under the pressure of pure statistical awesomeness and went with the group to Firefox.

This saved us hundreds if not thousands of hours and money in compatibility development. Sure, you can’t do this as aggressive and sudden as we did it with external systems, but we’re not talking about externals here – we’re talking about internal systems that only internal staff are seeing and only internal staff are using. Nobody’s going to say “I’m going to stop buying your stuff because you don’t support IE version 0.9”.

Another example is when we needed more screen real-estate. We were at the time, believe it or not, designing and working hard to push everything into 800 x 600. The perception among the team was that 800 pixels wide is the standard we should go for… here & there we were “brave” and went with 1024 wide – but we believed 800 wide was set in stone because of a simple human nature fact: An adorable and widely loved older lady that worked in one of our sections who weren’t extremely tech savvy but commanded the respect and love of all developers used 800×600. Here eyes weren’t so great, so she went with 800×600 because she could see the stuff on the screen better. She was also very, very verbal about anything that was too small font and had to scroll – so human nature guided us to think in error that a lot of people in the company used 800×600 and that’s a really important line not to cross (because nobody wanted to make her unhappy).

We went into GA, and pulled the stats of how many people used 800×600. Out of a company of about 80 staff members using the software there was exactly 1 machine using 800×600. There was about 20% using 1024×768… and the rest was all 1280 wide and greater. Boy were we wrong!!

We stopped spending so much time trying to re-work screen layouts to fit everything in to 800×600, and rather just spent the money, bought the loveable dame a really really big screen that made the font look bigger even at 1280 wide and decided to now standardise on 1280 wide as a minimum resolution. It saved a lot of money!

Know how people use your software without asking!

Soon the system grew and the company grew in staff (and the staff turnover was faster than usual). We got to the point where you can no longer have a good understanding of everybody’s behaviour in the system. Heck, we didn’t have the time to go around talking to the staff / users on a regular enough basis to understand their usage patterns.

We then, by now realising that Google Analytics answered questions before, went to the GA system and started to look at the top page views. Our system is very ajax driven and we simulated page-views by hooking jQuery’s ajax system to also do a call to GA to “simulare a page view” with whatever criteria we decided on to identify different modules in our system. A side-effect of a rapidly growing internal system is the lack of cohesion. This is unfortunate, but I’ve come to accept this over the years of developing these systems. A problem arises, however, when you have two systems doing the same thing, both using different sets of code – one being easy to maintain and another being hard to maintain.

There was two ways of passing a credit note in our system. One was hooked into all sorts of weird places in the code and was very complicated to maintain, the other one was simple and elegant and needless to say extremely easy to maintain. To our surprise, the easy to maintain and simplistic code was the more often used system to do the same job. That led us to extract the bits that we needed from the complex module and do just that, making them simplistic in design, remove the duplicated functionality and deprecating the visual element of the one and notified people that from now on there is only one way to do that task – which only effected 2 people who didn’t realize that the other way existed and was quite alright with changing their ways.

Again… lots of hours saved in maintaining a system that wasn’t being used!

Know how long certain systems run without asking / waiting for somebody to complain!

By now we were looking at Google Analytics regularly – at least once every 2 or 3 weeks – definitely once a month! Then Google launched something that helped us tremendously – “Page Load Time”.

We were constantly seeing high load on either our database server or our web server running the system, but with a system that does on average 400+ queries a second (sure there’s peaks during billing runs so let’s call it 50 or so per second), it was extremely difficult for us to figure out what was going on. We looked at (and fixed) a few slow queries, but then the majority of the slow queries was running for say 2 seconds – that’s nothing, right!?!

When the Page Load Time was live for about a week, the reason for our problem was as clear as day! We had a view, that showed you a client’s financial record with us. For the most part there’s only about 20 or so lines to show, as we show only the last 3 months and if you want you can change the dates and view the rest. However, here & there you’d find a client who have not only one or two products with us, but hundreds. With these clients there were hundreds of financial transactions created just by the monthly billing scripts, not even talking about the ad-hoc purchases done by them via our online gateways. The Page Load Time showed clearly that on average the load time of that view was less than 5 seconds but every now & then it spikes to hundreds of seconds. I remember the one entry recorded 1200+ seconds (yes, that’s  20 minutes… and yes our apache allowed that… and yes that’s stupid… but… anyways).

This realisation that this view was causing problems didn’t show us exactly what the problem was, but instead showed us where to look for the problem. Upon further inspection it appeared that this view was doing an SQL query for every line that it showed… remember those queries that I said only ran for 2 seconds? Multiply that by hundreds for the same view.

We fixed the problem in a few days and our database administrators are still to this day buying us beers for thanks (well, not really but I can dream!)

The list of problems we picked up using this method is countless, I can spend the whole day showing how things was before we had a look at Page Load Time and how it’s changed.

Find weird errors you can’t replicate without sitting with the user for infinity!

This is a tough one, because most errors that you can’t replicate is, well, just that… extremely tough to duplicate / find in the first place. But, if you use a tool, any tool, and find at least one of these without having to sit with the user watching them work for infinity and hope you have the right metrics when the error occurs to know what happened, then it’s time/money well spent.

Case in point, we had a problem where some times our system would return a blank page. It wasn’t on the same ajax action every time, it wasn’t at the same time, there seemed to be nothing in common from what we could see by just looking at the odd email we get about it every week or week and a half. This is where human nature started to play a role. It turns out, this error happened a lot more often than we thought, but because the humans using the system got so used to it and was so busy working with clients that instead of typing us an email explaining exactly what they did, they rather just clicked again and then it worked (most of the time… sometimes it took two clicks or three, but never more than say 5 clicks).

We put another callback to check when we’re getting an empty result when doing an ajax call, and each time that happened we simulated a distinct pageview to google analytics, something like “weird_empty_result_occurred.php” with a bit of extra information we thought we might need at the end of the simulated page view.

It was mere days, and we saw a pattern. The errors occurred on the hour, every hour from around 08:00 in the morning up to around 16:00 in the afternoon – and off course on and off during the night. This pointed us to our back-ground scripts, and we then realised there’s one script that runs each hour which touches the accounting / financial tables in the database, and it dawned on us by looking at the Google Analytics that this occurred only when trying to insert / update something in the financial side of the system. The penny dropped and we realised our script is locking up the tables. The fix was rather easy, and it was a problem we would’ve been looking for for weeks on end and spending hundreds or thousands of hours on. We didn’t. We saved time. We saved money.

Identifying reasons for call centre floods without asking!

Any company, especially public facing service oriented companies (like ISP’s), get days where all hell breaks loose and nobody knows exactly why. A nifty side-effect of having internal analytics (especially long-term telemetry) is that you can spot what sections of the system your support staff is suddenly using move of. Take for examples a mess-up with sending out the wrong invoices and people think they’re being over-billed.

When you have sent out the wrong invoices via email, you bet your bottom dollar the financial section of the system is going to get hit hard – both in your public facing online client zone, as well as in your internal client management system. Imagine the scenario where everything was calm up to 10:00 in the morning, suddenly the phones start to ring off the hook having 10 people waiting in the queue to talk to somebody for every support agent you have working in your company. Nobody has the time to stop, stand up & figure out what the heck is going on.

By looking at the real time analytics, one sees that there is normal usage of the network testing systems, normal usage of the new client fraud check systems, normal usage of just about any system – except for the financial system, and specifically the system looking at invoices. It doesn’t take a genius to figure out that “something is wrong directly or indirectly with invoices and people’s money”. You then realise that the invoice email script kicks off at 10:00 each morning, put two and two together and after checking one or two invoices realise that we multiplied every cent by say 1000 and not 100 – oops.

Quick, send out an apology email, put something on the hold-system for the phones, send text messages out, anything to calm the mob!

A variation of this happened to us (luckily not something stupid like the wrong multiplier, but something equally small yet massive in impact). Twitter and Facebook was on fire, we were being attacked on all fronts. When those first few hundred corrective emails and text messages hit, our clients (the mob) started to relay the same message on Twitter and Facebook and other mediums, meaning some people heard / read about the problem and that we’re fixing it even before they received their text message. What would’ve been an absolute PR nightmare turned out into something positive. Message of thanks for the communications started to stream in. There was a bunch of “wow, you guys really responded fast to the problem ‘you da best'” flew around. And we even had a few clients stand up for us and got really upset with the usual suspects that just hated us for the sake of hating (we all know those clients).

There’s no way of counting how much money having the analytics saved us that day… even if it saved us only a few minutes in time to figure out what exactly was going on… but you can count on it that it was a lot of money! (Especially in future sales / loss of sales / cancellations).

Identifying miss-use of your precious system without prowling around with hawk eyes!

We once had a heavy report, that was made available only to top management (because it was so heavy on the DB). We discussed doing some updates to the database design, creating some redundancy in the information we save, in order to make the report run faster. But, because this report was only supposed to be run once or twice a week and the amount of data wasn’t so much we put off the changes (costly changes). Sound familiar? Yeah, we all do these things. A year and a half later, there was a campaign run to get more sales in a certain area of the business. An area that this report is crucial to showing whether the campaign is successful or not.

A junior developer did the small project, took him two days to do, and it was launched. Then, suddenly, servers caught fire and satellites fell out of orbit on our servers. (Sure it wasn’t that bad, but that’s how it felt when we didn’t know WTF). There was multiple projects running at the same time, rollouts on a daily basis, and we didn’t know which section of code caused the problems. Turns out, none of the new code caused the problem:

We soon realised (again, you predicted it) by looking at Google Analytics that the report I mentioned above which runs large SQL queries and does a whole bunch of complex calculations was running often. In fact, the report was running every 60 seconds!!! It turns out, a bunch of people who had a vested interest in the sales campaign got access to the report a few months back, and today they were refreshing the report every now & then – “not that often, every 10 or so minutes” was one of their responses. Problem is, there was a team of people doing the same.

A quick “stop it!” was all that was needed to get rid of the high load on the server. Sure, we made some tweaks and put in some restrictions on the frequency the report can be run and cached the results and some more cool stuff. Fact is, once again, having analytics on the internal system saved us from buying a bigger server or whatever lengths we would’ve gone to because users who didn’t know any better (and that’s fine!) was abusing / miss-using a system that wasn’t intended to be hit so hard.

Security, no really… analytics = information, information = power i.t.o. security.

Lastly, a fun one. We’re based in South Africa and as with all internal systems one needs to make sure that security is rock-solid around the system. We were locking the system down to a VPN (and before that the IP address of the office). When we installed Google Analytics for the first time, we played around with it and setup triggers to email us things about geography, page views, and what not.

Years later (about 4 years later actually), we were tight on support staff. We decided to outsource over-flow work to a reputable company in India. This, I was unaware of at the time as we were busy with other projects and I was not briefed.

A VPN was setup for them, and they started to provide basic support while they were trained on our products and so on.

Next thing you know, I get the fright of my life… “oh no! we’re compromised! our internal system is being accessed from India!!”. I frantically locked down the system – clearly overreacting. And minutes later after the Indian company complained they can’t get into the system, was briefed about the situation.

It was funny, and I actually caused more hassle than good, but what if this was a real attack on our system? If it was real, I would’ve saved the day… no… wait… having analytics would’ve saved the day.

Conclusion

Use analytics. It’s great. True story!

Seriously, there’s so many examples, things that we take for granted today, that shows how having some kind of an analytics system helps you make your internal systems better, helps you protect them and helps you generally just figure out how people are using (and miss-using) it.

You might have mentioned that I put “without asking” at the end of most of the headings. Yes, there’s no reason why you shouldn’t be able to figure out most of these things in a company of 20 people. Heck, even in a company of 100 people it’s conceivably possible to do without having these analytics. But soon you’ll reach a point where either geography becomes a problem, or the size of the organisation becomes a problem. When that day comes, you’ll thank your younger (and more intelligent) self for installing those 3 lines of javascript to put in the analytics in your system.