Just a bit of a mind-dump, that I thought of in the car this morning.
Let’s go back 15+ years ago. Imagine your spouse leaves for work, taking the same highway as you do. Ten minutes later you also leave for work… next thing you know you’re stuck in a huge traffic jam on the highway, and unbeknown to you, your spouse is a few km / miles in front of you also stuck in the same traffic. Way back then, this was not ideal, but acceptable, because we didn’t know any better.
Then, the wide availability of cell phones arrived. The same scenario, when your spouse gets stuck in the traffic he/she can phone you and say “Hey, don’t take the highway, take the backroads.”. That was awesome, now you can miss the traffic and at least one of the two people can get to work on time (or at the very least close to on time). The question is, what was the problem with this? Well, for starters: What if I didn’t know the back roads that well? That made the world come up with relatively cheap GPS devices that you can put in your car and navigate by when you don’t know the roads. The second problem with that was that the bandwidth of communication is extremely limited. It requires you to first off know somebody in the current traffic problem. Then it requires that person to know you well enough, and know what your plans are ahead of time, and that that person takes the time to phone / text you about the traffic problem. How did we fix this?
Phone-home-GPS. TomTom I believe was one of the first affordable GPS devices that had the phone-home capability whereby they uploaded route and speed information to a centralised server. Then everybody that sets their route intersecting your route will be notified that this route is not ideal because “somebody else” drove it 15 minutes ago and got stuck in a traffic jam. This was awesome, it solved the bandwidth problem to an extent, because now everybody that has that same phone-home-GPS device without them knowing or even caring gives you information that could save you time. It also solved the problem of not knowing the back-roads, because, well, it’s a GPS device. Great! So what was wrong with that picture? The biggest problem I can see is still the bandwidth issue. The amount of people I know even today with those types of GPS devices are far and few between. Sure, it’s every 20 or so people who own a GPS device, but of all the people I know only about 1 in every 5 actually has a GPS device, that means that 1 in every 100 people using a specific road can actually help you. So, still a bandwidth problem.
Then came cell-phone-GPS. This truly fixed – almost – the bandwidth problem. Out of all the people I know, exactly all of them has a cell phone capable of running a GPS app in the background, and let’s face it… all the free apps now uploads traffic conditions to the servers (Google Maps and Waze being the most popular in our area). I said “almost” because there’s still a bandwidth problem even with everybody having some kind of a GPS app on their phone. Google Maps don’t share their data with Waze, or TomTom. And Waze and TomTom don’t share their info with Google Maps or each other. So, as you can see… if I’m using Waze, I’m limited to road users who are on Waze. It’s thus in my interest to actually choose the service that’s used the most by people in my area. It’s probably Waze or Google Maps (seeing as that comes with all Android phones pre-installed).
So, how do we fix the current bandwidth problem? (In the back of my mind I can see one of those “first world problems” memes about this).
I’m not entirely sure how we can fix the current bandwidth problem. One idea might be to get the companies to play nice and share information. Another might be to create an independent company that provides this centralised server to store traffic information on. Not sure all of them would want to play nice though. But, it’s not up to them, is it? It’s up to what the consumer wants – and face it, over the past few decades we’ve seen the consumer always gets what he wants (except not paying for stuff… oh… wait, these things are free, strike that!).
One thing I know for certain, and it is that tomorrow is most certainly going to be better than today – at the very least when it comes to technology. The pressures today on companies to react fast and give astonishingly great products and services to consumers are immense. It’s a much higher pressure than a decade ago, or even a year ago. The reason for this is that we now have the technology to voice our dislike or frustration with a service or product on a global and massive scale. Twitter. Facebook. Tumblr. Name but a social network and it’s a medium for people to voice their opinions – and the companies that’s going to be standing in a decade’s time is most certainly listening.
So, is your company listening? If not and you own the company, start listening. If you don’t own the company, have a chat with the owner and try to convince them to start to listen to the masses.
See you tomorrow – with a little bit of luck and a lot of confidence we’ll have even better tech and gadgets to talk about!
Introduction: Analytics on internal systems, and why it’s important!
The modern web is all about numbers, statistics, getting feedback fast and adapting to the feedback to make the experience that our users have when visiting out sites better, and having them come back to the site more often – which translates into sales / conversions. We advocate to our clients the crucial need to have a good analytics toolkit installed on their system. Something that measures page views, events, technology, referrals, goal funnels, and many more metrics. Each one of these metrics is a necessary and very useful part of the “big picture” that we grow of out site, how people are using it and what to do (and not to do) in order to guarantee success.
Here’s the key question, though: Why do so many companies forget that their internal systems also has users?
By users, I mean mere mortal human beings that have opinions about the software, get into bad (and good) habits, prefer certain technologies and uses our systems in their own way, as opposed to the way that we thought or intended them to use it.
At Afrihost we have internal systems just like any other organisation. We are in the lucky position that our internal systems are all web-based (well, most of them anyways). Having web-based internal systems means quite frankly that putting something like Google Analytics, Piwik, Clicky, Open Web Analytics – to name but a few – on your system is very easy and an absolute must.
To answer the “why” is very difficult and different for every implementation. It’s different because each implementation of analytic tracking tries to answer a specific question. Once the question is answered, something is done about the answer (we either change something or we tap ourselves on the shoulder for a good job done – the formar is more often the case). Then we move on to ask the next question. Often, however, we don’t know the question until we see a problem staring us in the face. We can only see the problem if we track it somehow. Let me explain a few “why” or “problems” or “answers” by way of example:
Know what technology do people use without asking!
Initially when we started our “intranet” or “internal system”, we went through the same pains as most Web Software Development groups go: Why isn’t this working in IE? Why does it look this way in FireFox but in Chrome it’s all broken? Why do we even bother with Safari? Wouldn’t it be cool if we can just stick with one platform? We then realised that we actually have to decide on a single web platform to use: It’s was going to be less costly to have everybody standardise on one platform (at the time FireFox), than to spend hundreds of hours in software development to plan and develop for multiple browsers. (Remember, this was the time of IE 6 and 7 which was crap to say the least – and that’s actually giving them a compliment).
Lucky for us, we put in Google Analytic tracking in our internal system a few weeks before that realisation – not because we thought we needed Google Analytics, but purely because one of the developers thought it was really cool to see all the pretty graphs and stuff (true story!).
We went into Google Analytics and saw to our surprise that the majority of users actually did user Firefox… and the ones who used IE were on the newer IE7 which was less of a tirant than IE6. This made the decision to standardise everybody on Firefox much easier because we could take the stats to our bosses and say: “See, most of the company are on FF anyways, why don’t we just standardise on that?”. There was the usual pushback from IE-die-hard-kinds, but each time we brought out the stats and the pretty pie charts (updated off course after each convert) they buckled under the pressure of pure statistical awesomeness and went with the group to Firefox.
This saved us hundreds if not thousands of hours and money in compatibility development. Sure, you can’t do this as aggressive and sudden as we did it with external systems, but we’re not talking about externals here – we’re talking about internal systems that only internal staff are seeing and only internal staff are using. Nobody’s going to say “I’m going to stop buying your stuff because you don’t support IE version 0.9”.
Another example is when we needed more screen real-estate. We were at the time, believe it or not, designing and working hard to push everything into 800 x 600. The perception among the team was that 800 pixels wide is the standard we should go for… here & there we were “brave” and went with 1024 wide – but we believed 800 wide was set in stone because of a simple human nature fact: An adorable and widely loved older lady that worked in one of our sections who weren’t extremely tech savvy but commanded the respect and love of all developers used 800×600. Here eyes weren’t so great, so she went with 800×600 because she could see the stuff on the screen better. She was also very, very verbal about anything that was too small font and had to scroll – so human nature guided us to think in error that a lot of people in the company used 800×600 and that’s a really important line not to cross (because nobody wanted to make her unhappy).
We went into GA, and pulled the stats of how many people used 800×600. Out of a company of about 80 staff members using the software there was exactly 1 machine using 800×600. There was about 20% using 1024×768… and the rest was all 1280 wide and greater. Boy were we wrong!!
We stopped spending so much time trying to re-work screen layouts to fit everything in to 800×600, and rather just spent the money, bought the loveable dame a really really big screen that made the font look bigger even at 1280 wide and decided to now standardise on 1280 wide as a minimum resolution. It saved a lot of money!
Know how people use your software without asking!
Soon the system grew and the company grew in staff (and the staff turnover was faster than usual). We got to the point where you can no longer have a good understanding of everybody’s behaviour in the system. Heck, we didn’t have the time to go around talking to the staff / users on a regular enough basis to understand their usage patterns.
We then, by now realising that Google Analytics answered questions before, went to the GA system and started to look at the top page views. Our system is very ajax driven and we simulated page-views by hooking jQuery’s ajax system to also do a call to GA to “simulare a page view” with whatever criteria we decided on to identify different modules in our system. A side-effect of a rapidly growing internal system is the lack of cohesion. This is unfortunate, but I’ve come to accept this over the years of developing these systems. A problem arises, however, when you have two systems doing the same thing, both using different sets of code – one being easy to maintain and another being hard to maintain.
There was two ways of passing a credit note in our system. One was hooked into all sorts of weird places in the code and was very complicated to maintain, the other one was simple and elegant and needless to say extremely easy to maintain. To our surprise, the easy to maintain and simplistic code was the more often used system to do the same job. That led us to extract the bits that we needed from the complex module and do just that, making them simplistic in design, remove the duplicated functionality and deprecating the visual element of the one and notified people that from now on there is only one way to do that task – which only effected 2 people who didn’t realize that the other way existed and was quite alright with changing their ways.
Again… lots of hours saved in maintaining a system that wasn’t being used!
Know how long certain systems run without asking / waiting for somebody to complain!
By now we were looking at Google Analytics regularly – at least once every 2 or 3 weeks – definitely once a month! Then Google launched something that helped us tremendously – “Page Load Time”.
We were constantly seeing high load on either our database server or our web server running the system, but with a system that does on average 400+ queries a second (sure there’s peaks during billing runs so let’s call it 50 or so per second), it was extremely difficult for us to figure out what was going on. We looked at (and fixed) a few slow queries, but then the majority of the slow queries was running for say 2 seconds – that’s nothing, right!?!
When the Page Load Time was live for about a week, the reason for our problem was as clear as day! We had a view, that showed you a client’s financial record with us. For the most part there’s only about 20 or so lines to show, as we show only the last 3 months and if you want you can change the dates and view the rest. However, here & there you’d find a client who have not only one or two products with us, but hundreds. With these clients there were hundreds of financial transactions created just by the monthly billing scripts, not even talking about the ad-hoc purchases done by them via our online gateways. The Page Load Time showed clearly that on average the load time of that view was less than 5 seconds but every now & then it spikes to hundreds of seconds. I remember the one entry recorded 1200+ seconds (yes, that’s 20 minutes… and yes our apache allowed that… and yes that’s stupid… but… anyways).
This realisation that this view was causing problems didn’t show us exactly what the problem was, but instead showed us where to look for the problem. Upon further inspection it appeared that this view was doing an SQL query for every line that it showed… remember those queries that I said only ran for 2 seconds? Multiply that by hundreds for the same view.
We fixed the problem in a few days and our database administrators are still to this day buying us beers for thanks (well, not really but I can dream!)
The list of problems we picked up using this method is countless, I can spend the whole day showing how things was before we had a look at Page Load Time and how it’s changed.
Find weird errors you can’t replicate without sitting with the user for infinity!
This is a tough one, because most errors that you can’t replicate is, well, just that… extremely tough to duplicate / find in the first place. But, if you use a tool, any tool, and find at least one of these without having to sit with the user watching them work for infinity and hope you have the right metrics when the error occurs to know what happened, then it’s time/money well spent.
Case in point, we had a problem where some times our system would return a blank page. It wasn’t on the same ajax action every time, it wasn’t at the same time, there seemed to be nothing in common from what we could see by just looking at the odd email we get about it every week or week and a half. This is where human nature started to play a role. It turns out, this error happened a lot more often than we thought, but because the humans using the system got so used to it and was so busy working with clients that instead of typing us an email explaining exactly what they did, they rather just clicked again and then it worked (most of the time… sometimes it took two clicks or three, but never more than say 5 clicks).
We put another callback to check when we’re getting an empty result when doing an ajax call, and each time that happened we simulated a distinct pageview to google analytics, something like “weird_empty_result_occurred.php” with a bit of extra information we thought we might need at the end of the simulated page view.
It was mere days, and we saw a pattern. The errors occurred on the hour, every hour from around 08:00 in the morning up to around 16:00 in the afternoon – and off course on and off during the night. This pointed us to our back-ground scripts, and we then realised there’s one script that runs each hour which touches the accounting / financial tables in the database, and it dawned on us by looking at the Google Analytics that this occurred only when trying to insert / update something in the financial side of the system. The penny dropped and we realised our script is locking up the tables. The fix was rather easy, and it was a problem we would’ve been looking for for weeks on end and spending hundreds or thousands of hours on. We didn’t. We saved time. We saved money.
Identifying reasons for call centre floods without asking!
Any company, especially public facing service oriented companies (like ISP’s), get days where all hell breaks loose and nobody knows exactly why. A nifty side-effect of having internal analytics (especially long-term telemetry) is that you can spot what sections of the system your support staff is suddenly using move of. Take for examples a mess-up with sending out the wrong invoices and people think they’re being over-billed.
When you have sent out the wrong invoices via email, you bet your bottom dollar the financial section of the system is going to get hit hard – both in your public facing online client zone, as well as in your internal client management system. Imagine the scenario where everything was calm up to 10:00 in the morning, suddenly the phones start to ring off the hook having 10 people waiting in the queue to talk to somebody for every support agent you have working in your company. Nobody has the time to stop, stand up & figure out what the heck is going on.
By looking at the real time analytics, one sees that there is normal usage of the network testing systems, normal usage of the new client fraud check systems, normal usage of just about any system – except for the financial system, and specifically the system looking at invoices. It doesn’t take a genius to figure out that “something is wrong directly or indirectly with invoices and people’s money”. You then realise that the invoice email script kicks off at 10:00 each morning, put two and two together and after checking one or two invoices realise that we multiplied every cent by say 1000 and not 100 – oops.
Quick, send out an apology email, put something on the hold-system for the phones, send text messages out, anything to calm the mob!
A variation of this happened to us (luckily not something stupid like the wrong multiplier, but something equally small yet massive in impact). Twitter and Facebook was on fire, we were being attacked on all fronts. When those first few hundred corrective emails and text messages hit, our clients (the mob) started to relay the same message on Twitter and Facebook and other mediums, meaning some people heard / read about the problem and that we’re fixing it even before they received their text message. What would’ve been an absolute PR nightmare turned out into something positive. Message of thanks for the communications started to stream in. There was a bunch of “wow, you guys really responded fast to the problem ‘you da best'” flew around. And we even had a few clients stand up for us and got really upset with the usual suspects that just hated us for the sake of hating (we all know those clients).
There’s no way of counting how much money having the analytics saved us that day… even if it saved us only a few minutes in time to figure out what exactly was going on… but you can count on it that it was a lot of money! (Especially in future sales / loss of sales / cancellations).
Identifying miss-use of your precious system without prowling around with hawk eyes!
We once had a heavy report, that was made available only to top management (because it was so heavy on the DB). We discussed doing some updates to the database design, creating some redundancy in the information we save, in order to make the report run faster. But, because this report was only supposed to be run once or twice a week and the amount of data wasn’t so much we put off the changes (costly changes). Sound familiar? Yeah, we all do these things. A year and a half later, there was a campaign run to get more sales in a certain area of the business. An area that this report is crucial to showing whether the campaign is successful or not.
A junior developer did the small project, took him two days to do, and it was launched. Then, suddenly, servers caught fire and satellites fell out of orbit on our servers. (Sure it wasn’t that bad, but that’s how it felt when we didn’t know WTF). There was multiple projects running at the same time, rollouts on a daily basis, and we didn’t know which section of code caused the problems. Turns out, none of the new code caused the problem:
We soon realised (again, you predicted it) by looking at Google Analytics that the report I mentioned above which runs large SQL queries and does a whole bunch of complex calculations was running often. In fact, the report was running every 60 seconds!!! It turns out, a bunch of people who had a vested interest in the sales campaign got access to the report a few months back, and today they were refreshing the report every now & then – “not that often, every 10 or so minutes” was one of their responses. Problem is, there was a team of people doing the same.
A quick “stop it!” was all that was needed to get rid of the high load on the server. Sure, we made some tweaks and put in some restrictions on the frequency the report can be run and cached the results and some more cool stuff. Fact is, once again, having analytics on the internal system saved us from buying a bigger server or whatever lengths we would’ve gone to because users who didn’t know any better (and that’s fine!) was abusing / miss-using a system that wasn’t intended to be hit so hard.
Security, no really… analytics = information, information = power i.t.o. security.
Lastly, a fun one. We’re based in South Africa and as with all internal systems one needs to make sure that security is rock-solid around the system. We were locking the system down to a VPN (and before that the IP address of the office). When we installed Google Analytics for the first time, we played around with it and setup triggers to email us things about geography, page views, and what not.
Years later (about 4 years later actually), we were tight on support staff. We decided to outsource over-flow work to a reputable company in India. This, I was unaware of at the time as we were busy with other projects and I was not briefed.
A VPN was setup for them, and they started to provide basic support while they were trained on our products and so on.
Next thing you know, I get the fright of my life… “oh no! we’re compromised! our internal system is being accessed from India!!”. I frantically locked down the system – clearly overreacting. And minutes later after the Indian company complained they can’t get into the system, was briefed about the situation.
It was funny, and I actually caused more hassle than good, but what if this was a real attack on our system? If it was real, I would’ve saved the day… no… wait… having analytics would’ve saved the day.
Use analytics. It’s great. True story!
Seriously, there’s so many examples, things that we take for granted today, that shows how having some kind of an analytics system helps you make your internal systems better, helps you protect them and helps you generally just figure out how people are using (and miss-using) it.
Wednesday we joined Gian Visser for the MyBroadband2012 event in the Vodacom World center. One of the many awards that they hand out each year among great ones like the honorary award to Pieter Uys for his role in the innovation of the broadband landscape, is the ISP of the Year Award, that went to Afrihost this year. This is the second year in a row that Afrihost won the award!
Ever since Afrihost changed the face of broadband in South Africa in 2009 by charging R29 per GB (when the going rate for a gig was no less than R70) it’s been quite a roller coaster ride – and I love roller coasters! Funny enough, in 2009 Axxess (whom we acquired in 2011) won the award that year, so kudos to them for that one!
Then in 2011 Afrihost won the award for the first time!
Naturally, winning these awards is a collective effort of the entire organisation… from the CEO, Directors, GM, Management Team, Branding, Support, Accounting, Sales, Operations, Dev… any department all the way through to the cleaning staff is equally responsible for making us achieve the award and goals we set out to achieve in 2012. I would like to, however, single my team out in this post…
I’d like to commend the Dev Team at Afrihost for the exceptional performance over the past year in helping Afrihost to achieve this goal, you guys have really made Afrihost proud and are exceeding the mantra “Wouldn’t it be (incredible|breathtaking|amazing|awe-inspiring|extraordinary)+ if…“. It is indeed amazing to be part of a team where the systems have been written well enough and fast enough to achieve an award like this for a second year in a row.
I’d like to single out the developers and engineers that was part of my little world leading up to this:
- Sacheen Dhanjie – it’s always important to have a guy in the team who’s passionate about good architecture, proper use of design patterns and reusability of code – these developers are the cornerstone of code that will live 10 years from now and still perform optimally and be robust.
- Dale Attree – organisations like Afrihost cannot do without having what I call a gunslinger – these developers are people who know where the fine line lies between working fast and efficient and being seconds too late/slow; they make snap decisions on what action to take to minimize damage to systems and data, and understand that sometimes you have to do something fast and fix data later – because it will make the organisation tons of money being fast-to-market vs seconds/minutes/days/weeks/months too late.
- Warren Clifton – rarely do you find a developer who understands people like this guy – the so called “soft-skills” that many people refer to lately is absolutely key to developers in the modern era. Gone are the days where you have a guy in a white coat with thick glasses sitting in a dungeon doing machine code that changes the world; the era where developers and top management have to have a close working relationship is here – and this guy understands that probably better than most of us. Not only does he understand it upward, but when I need something communicated interdepartmentally (for instance from Dev to Support/Sales/Accounting/etc) there is no better person than Warren. On a side-note, Warren is also one of the developers who’s grown his knowledge and expertise the fastest.
- Andrew McGill – not officially being part of the dev team is no excuse in Andrew’s mind to be one of the most valuable and exceptional R&D specialists and technologists, contributing tons to our automated and intelligent processes; more often than not all of us find ourselves wanting when it comes to understanding the underlying technologies – and Andrew is exceptionally good at not only understanding the technologies that we use, but also utilising it to save thousands of man-hours!