How small errors and quick fixes in “little data” can lead to major problems in big data
In the mid-1990s, a new employee of Sun Microsystems in California kept disappearing from their database. Every time his details were entered, the system seemed to eat him whole; he would disappear without a trace. No one in HR could work out why poor Steve Null was database kryptonite. The staff in HR were entering the surname as “Null,” but they were blissfully unaware that, in a database, NULL represents a lack of data, so Steve became a non-entry. To computers, his name was Steve Zero or Steve McDoesNotExist. Apparently, it took a while to work out what was going on, as HR would happily reenter his details each time the issue was raised, never stopping to consider why the database was routinely removing him.
Since the 1990s, databases have become more sophisticated, but the problem persists. Null is still a legitimate surname and computer code still uses NULL to mean a lack of data. A modern variation on the problem is that a company database will accept an employee with the name Null, but then there is no way to search for them. If you look for people with the name Null, it claims there are, well, null of them. Because computers use NULL to represent a lack of data, you’ll occasionally see it appear when a computer system somewhere has made a mistake and not retrieved the data it needs. I searched my inbox and found a few emails addressed to null and one asking about my TomTom $NULL$ device. But the holy grail is the “hot singles near null” pop-up ads. I can see how this happens. Checking if a data entry is equal to NULL is a handy step in programming.
I wrote a program to maintain a spreadsheet of all my YouTube videos. Whenever it has to enter new videos, it needs to find out where the next blank row is. So the program initially sets active_ row = 1 to start at the top row and runs this piece of code. (I’ve tidied it up slightly to make it more human-readable.)
Before you can crunch your big data, you need to collect and store it in the first place. I call this “little data”: Looking at data one piece at a time.
In a lot of computer languages, != means “not equal to.” So, for each row, it checks if the data in the first cell is not equal to null. If it’s not equal to null, it adds one to the row and keeps going until the first blank row. If my spreadsheet had rows starting with people’s surnames, then Steve Null could have broken my code (depends how clever the programming language is). Modern employee databases can go wrong during searches because they check search_term != NULL before proceeding. This is to catch all the times people hit Search without entering anything to search for. But it also stopped any searches for people named Null.
Other legitimate names can also be filtered out by well-meaning database rules. A friend of mine worked on a database for a large financial company in the U.K., and it would allow only names with three or more letters in order to filter out incomplete entries. Which was fine, until the company expanded and started employing people from other countries, including China, where two-character names are perfectly normal. The solution was to assign such employees longer, anglicized names that fit the database criteria — which feels far from satisfactory.
Big data is all very exciting, and there are all sorts of amazing breakthroughs and findings coming out to aid in analyzing massive datasets, as well as a whole new field of mathematical mistakes (which we will deal with later). But before you can crunch your big data, you need to collect and store it in the first place. I call this “little data”: Looking at data one piece at a time. As Steve Null and his relatives show us, recording data is not as easy as we’d hoped. Carrying on in the same vein as Steve Null, I’d like you to meet Brian Test, Avery Blank, and Jeff Sample.
The Null problem can be fixed by encoding names in a format for only character data, so that it doesn’t get confused with the data value of NULL. But Avery Blank has a bigger problem: humans. When Avery Blank was in law school, she had difficulty getting an internship because her applications were not taken seriously. People would see “Blank” in the surname field and assume it was an incomplete application. She always had to get in touch and convince the selection committee that she was a real human.
Brian Test and Jeff Sample fell afoul of the same problem, but for slightly different reasons. When you set up a new database, or a way to input data, it’s good practice to test it and make sure it’s all working. So you feed through some dummy data to check the pipeline. I run a lot of projects with schools, and they often sign up online. I’ve just opened my most recent such database and scrolled to the top. The first entry is from a Ms. Teacher who works at Test High School on Test Road in the county of Fakenham. She’s probably a relation of Mr. Teacher from St. Fakington’s Grammar School, who seems to sign up for everything I do. To avoid being deleted as unwanted test data, when Brian Test started a new job, he brought in a cake for all his new colleagues to enjoy. Printed on the cake was a picture of his face, with the following words written in icing: “I’m Brian Test and I’m real.” Like a lot of office problems, the issue was solved with free cake, and he was not deleted again.
It’s not just humans who are deleting people like Brian Test — often it is automated systems. People enter fake data into databases all the time, so database administrators set up automated systems to try to weed them out. An email address like null@a‑null-based-url (similar to what belongs to real human Christopher Null) is often autoblocked to cut back on spam. Recently, a friend of mine could not sign an online petition because his email address has a + sign in it: a valid character, but one often used to mass-generate email addresses and to spam online polls. So he was locked out.
If you inherit a database-killing last name, you can either wear it as a badge of honor or take some deed-poll action. But if you are a parent, please don’t give your child a first name that will set them up for a lifetime of battling computers. And given that over three hundred children in the USA since 1990 have been named Abcde, it’s worth spelling this out: Don’t name your child anything like Fake, Null, or DECLARE @T varchar(255), @C varchar(255); DECLARE Table_Cursor CURSOR FOR SELECT a.name, b.name FROM sysobjects a, syscolumns b WHERE a.id = b.id AND a.xtype =’u’ AND (b.xtype = 99 OR b.xtype = 35 OR b.xtype = 231 OR b.xtype = 167); OPEN Table_Cursor; FETCH NEXT FROM Table_Cursor INTO @T, @C; WHILE (@@FETCH_ STATUS = 0) BEGIN EXEC(‘update [‘ + @T + ‘] set [‘ + @C + ‘] = rtrim(convert(varchar,[‘ + @C + ‘]))+ ‘’”); FETCH NEXT FROM Table_Cursor INTO @T, @C; END; CLOSE Table_ Cursor; DEALLOCATE Table_Cursor;.
That last one is not even a joke. It looks like I fell asleep on my keyboard but it is actually a fully functional computer program that will scan through a database without needing to know how it is arranged. It will simply hunt down all the entries in the database and make them available to whoever managed to sneak that code into the database in the first place. It’s yet another example of online humans being jerks. Typing it in as someone’s name is not a joke either. This is known as an SQL injection attack (named after the popular database system SQL; sometimes pronounced like sequel). It involves entering malicious code via the URL of an online form and hoping whoever is in charge of the database has not put enough precautions in place. It’s a way to hack and steal someone else’s data. But it relies on the database running the code you’ve managed to sneak in. It may seem ridiculous that a database would process incoming malicious code, but without the ability to run code a modern database would lose its functionality. It’s a balancing act to keep a database secure but able to support advanced features that require running code.
Just to be completely clear: That is real code in my example. Do not type that into a database. It will mess things up. That very code was used in 2008 to attack the UK government and the United Nations — except some of it had been converted into hexadecimal values to slip by security systems looking for incoming code. Once in the database, it would unzip back into computer code, find the database entries then phone home to download additional malicious programs. This is it when it was camouflaged:
script.asp?var=random’;DECLARE%20@S%20 NVARCHAR(4000);SET%20@S=CAST(0x440045 0043004C004100520045002000400054002000760061 007200630068006100720028 . . . [another 1,920 digits] 004F00430041005400450020005400610062006 C0065005F0043007500720073006F007200%20AS% 20NVARCHAR(4000));EXEC(@S); —
Sneaky, huh? From unfortunate names to malicious attacks, running a database is difficult. And that’s even before you have to deal with any legitimate data-entry mistakes.
In Los Angeles, there is a block of land on the corner of West First Street and South Spring Street that houses the offices of the Los Angeles Times. It is just down the street from City Hall and directly across from the L.A. Police Department. There may be some rough areas of L.A. best avoided by tourists, but this is certainly not one of them. The area looks as safe as safe can be… until you check the LAPD’s online map of reported crime locations.
Between October 2008 and March 2009, there were 1,380 crimes on that block. That’s around 4% of all crimes marked on the map, in a city of more than 500 square miles. When the Times editors noticed this, they politely asked the LAPD what was going on. The culprit was the way data was encoded before going into the mapping database. All reported crimes have a location recorded, often handwritten, and this is automatically geocoded by computer to latitude and longitude. If the computer is unable to work out the location, it simply logs the default location for Los Angeles: the front doorstep of the LAPD headquarters. The LAPD fixed this with a time-honored method for taking care of a sudden glut of criminals: It transported them to a distant island. Null Island.
Null Island is a small but proud island nation off the west coast of Africa. It’s located about 370 miles south of Ghana, and you can find it by putting its latitude and longitude into any mapping software of your choice: 0,0. Fun fact: Its coordinates look like the facial expression of anyone deported there. For, you see, outside of databases, Null Island does not exist. It really does live up to its slogan: “Like No Place on Earth!”
Bad data is the scourge of databases, in particular when the data has been originally written down by fallible humans in their imprecise handwriting. Add to this the ambiguity of place names (for example, I had an office on Borough Road, and there are forty-two Borough Roads in the U.K. alone, not to mention two Borough Road Easts), and you have a roadmap to disaster. Whenever a computer cannot decipher a location, it still has to fill something in, and so 0,0 became the default location. The island where bad data goes to die. Except cartographers took this seriously. Cartography was a rather antiquated and fusty old discipline until it was swept up by the modern tech revolution. Now, they have an audience for their own brand of humor. For generations, cartographers have been sneaking fictitious places into real maps (often as a way to expose people plagiarizing their work), and it was inevitable that Null Island would take on a life of its own. So, they literally put it on the map.
If you believe their marketing material, Null Island has a thriving population, a flag, a department of tourism, and the world’s highest per capita Segway ownership. Even when the data has made it into a database, it is not safe.
All Rights Reserved for Matt Parker