On the wrong end of a network storm.

On the wrong end of  a network storm.

This happened over fifteen years ago. The company is no more, but I will make up names to protect the innocent. It is a story of a Middle Eastern telecommunications operator, a bewildered & naive engineer (Me), stolen phones and a subsequent nasty network storm.

I used to have a job deploying a software solution within a new industry at the time termed MDM (Mobile Device Management). My job was to go on site somewhere and set up a server. It was a Solaris unix machine (sparc processors), running a very early version of JBoss. I enjoyed it at the time, being in my twenties I had never traveled much before and my company would pay me to fly to new places, put me in a hotel and I could eat McDonalds and the company would pay for it! It was all pretty wow at the time.

My job was to deploy and connect the companies software to other systems within the operators network, mostly systems with names such as "SMSC" (short message service centre) and various 3G network elements "GPRS Gateways". The solution I looked after, would provision new phones 'over the air'. Essentially, it would send a specially crafted SMS that had a payload, that upon opening, would populate connection profiles, such as internet or MMS (Multimedia Messaging Service). You might be wondering why I was not writing code, well it was not so easy to enter direct into a development role back then, so a common route was as a sysadmin.

Someone had the grand idea of running a database coupled with a network sniffer, that could snoop connections on an old network called SS7 (a complete security nightmare with zero auth, but that's a story for a different time). The system would snort for something called 'triplets'. A triplet is three elements required for a phone to interact with a radio network.

  • IMSI: International mobile subscriber identity (an unique identifier on the SIM card)
  • IMEI: International Mobile Equipment Identity (phones serial number)
  • MSISDN: Mobile Station Integrated Services Digital Network (phone number, I know right?)

These three 'triplets' would be logged and stashed into a DB. A very simple schema, where the IMSI would be the primary key, as users could have more than one SIM with the same phone number.

If an existing MSISDN/IMSI was found in the DB, but with a different IMEI, you knew a 'SIM swap' had occurred, e.g. someone had got a new phone. It being a new phone, would suggest they needed connection settings via SMS. As these were metered services, a phone not able to connect, was a loss of revenue. This got the sales people excited! If you can show a solution that can not only tap new revenue faster, but also prevent a loss of revenue, you're generally onto a winner!

What could go wrong?

I fired up a serial connection to the headless machine in the data centre. Set an IP and retreated to a quieter office where I forwarded an X-session, started x-windows and installed and deployed the server. This involved connecting a plugin to interface to the database / smart monitoring system that would sniff out events (most likely a cron-job with a perl script in front of a postgres DB).

I started the server and felt like nothing was going to go wrong and everything was going really well.

Well as it happens it never went well and plenty went wrong. In fact quite badly wrong.

We deployed direct to production, yah, you did hear that right. Back then we were not afforded the many protections that the current breed of engineers rely on. There was no decent coverage of unit and integration tests scrutinising every pull request, before graduating it to staging and only after then production. Well at least they were not at the company where I was hired.

At first the system started to whizz away sniffing new entries. I sat nearby another engineer who worked for the telecommunications company. He was monitoring the SMS messaging server and we could see messages going out to new phones. It was all very exciting!

I left site a few hours later, feeling very proud of myself. I went and had some food and headed back to my hotel to drink some Beers (Beer drinking was limited to hotel rooms, it being an Islamic country).

What must have been six hours later I got a call that jolted me awake, right at dawn.

"Hey Luke, sorry to wake you. Yeah, this is not good, not good at all. The whole thing has crashed. I don't know. It's just...I can't even get a decent shell session on the machine (sound of ctrl-c being mashed), its burning up and running at a crawling pace. You have to come in, my boss is escalating to senior management"

So I had no choice, but to throw some clothes on, jump in a taxi and drive through the hot desert night and back to the data center. The whole time harboring a big sense of impending doom at what might be going on and more importantly, was it my fault?

What occured next was a blur of events, with a huge dashing of imposter syndrome garnished on top, while I tried to figure out what the hell had gone wrong.

For some reason the system was experiencing freakish events such as the same phone going through around 50 SIM card changes within the space of ten minutes. I had no idea what caused this, it could not be the network lying, surely not? I never had time to fight the blame game. In the end one of the folks from the customer team decided to call up one of the numbers from one of the frantically swapped events:

(translated from Arabic)

"Hello, why you keep swapping your phone!?.... Ah, OK. Sorry Jadda (grandmother), we are having some problems, but nothing to worry about, its ok Jadda, inshallah we will have this fixed shortly".

So there seemed to be nothing nefarious going on. The pain continued for another few hours....

After sitting around feeling completely powerless to do anything, but acting like I was in a deep debugging session, I got a call from my companies HQ. Another team member had seen this sort of shenanigans before. I was told a patch was incoming.

Oh boy what a relief it was to hear that. I told the customer, everything is going to OK. They looked deeply sceptical.

The Cause

At the time phone theft in Europe was humongous.

The security of devices was far less back then, both on a software and hardware side. Everytime a device was reported to have been stolen, its serial number (the IMEI) would be recorded on a deny list. After that it could not get on the network and was basically a brick.

To get around this, phones that were stolen would have their firmware reflashed to a known good IMEI. The problem is though, the thieves would not switch things up. Instead they would use the same IMEI, over and over again. A lot of these phones would then be shipped off to the Middle East and Africa to be sold, where the fraud detection systems were less than stellar.

So if you're not quite getting it yet; Everytime a phone was turned on, it would read as being with a new SIM card. It looked like the same small group of phones were being shared between thousands of people. This is like flashing the same MAC address on all of your network switches.

A few hours later we had written a patch to look at past associations and apply some logic. I applied this, restarted and phew! Things were running smoothly again.

I headed back to my hotel, feeling pretty good about myself again.

When I got back home, I started to apply for software engineering roles. I figured it's better to write the bug, then be the poor sucker who has to deal with it blowing up in production.

This is why I believe, in an ideal world (which far from exists), engineers should spend a bit of time on the front line of the code they write. It certainly gives you a different perspective on how powerful your work can be. Both good and  bad.