Data Cleansing: An Unglamorous Necessity

By Mike Varney

“Garbage in, garbage out” – a common phrase in the data science world, and for good reason. Whilst we may believe they are, or wish them to be, computers are not magic. They’re superfast at what they do, but until technology takes a significant (and potentially scary) leap forward, you’ll only get from them exactly as much as you put in – and dealing with datasets is no different.

In the eight years as EasySAM’s senior data analyst, I’ve seen datasets on both ends of the quality spectrum – and everything in between. Comprehensive. Precise. Detailed. Lacking. Flawed. No matter how you might describe what is in front of you, one thing never changes: to accurately reflect reality, every data source (regardless of where it originated from or its assumed accuracy) must be reviewed and cleansed before it can be used for its intended purpose.

And today, we’re going to look through the cycle involved in getting that data squeaky clean.

What are we looking for?

With so many resources available for data collection on a seemingly infinite number of subjects, and the variance in quality across these sources so unpredictable, it is all too easy for issues to appear. For our SAM engagements, we’ll typically use 3 or 4 different sources from the dozen we frequently use – from baselining (Active Directory) to specific SAM tools (Flexera, SNOW, Xensam) and other useful applications (JAMF, RVTools, Smarter365).

Whatever you’re using though, don’t get too comfortable. The capture of your data may be automated; where quality checks can be missing, leading to consistency issues. If you’re working with multiple datasets that have some crossover (as EasySAM do), there will be duplicates to contend with. And some data simply comes from unreliable sources, where what you see on screen just can’t be trusted.

Input Errors:

Data entry is often a manual process completed by a human, and even the most thorough of us are susceptible to entering the “wrong” information. A simple mistype here, or a misunderstanding there and you’re already on the back foot. Do this ten times in ten different ways, and your data starts becoming very messy, very quickly. Open-ended fields (such as free text) can be the biggest culprits of this, where two values (for example, “No” and “N”) may be identical to an end user, but a computer won’t interpret them this way.

Errant Capture:

Errors are not exclusive to humans, with machines capable of malfunctioning at any time too; and producing values outside of an accepted range. A computer won’t have 64 processors in it. Its last usage date won’t have been 01/01/1900. Humans should easily catch these errors, but computers will no doubt miss them if they fit the mould of the required data type (unless a permitted range has been coded to check for such errors).

Duplicates:

To ensure a comprehensive dataset, it’s always preferable to gather as much data as possible; but that leads to a high probability of duplication across the collected data. Often, this is not a simple issue to resolve. If there is even just one character difference in a device name, a computer (and possibly a human too) won’t recognise this as a duplicate, even if it turns out they are one and the same.

Lack of Standardisation:

When using multiple data sources, finding a lack of standardisation between them is common. To achieve a satisfactory result, all data that is similar needs to be streamlined for ease of use in whatever its intended purpose is. An easy observation to make, but not always an easy solution to implement.

Standardisation issues aren’t just limited to when using multiple sources, they can even occur within a single source – particularly with open-ended fields that utilise human input. There may be language/spelling differences or use of multiple terms across teams for the same topic or subject matter.

Spotting those problems

Now that we know some of the potential causes of unclean data, the next step is determining if the data in front of us has any issues.

Range Checking:

One of the easiest things to check, and understand, when dealing with numerical values is whether a datapoint is sitting within an expected range. If your data says a device has zero software installs on it, this will get the alarm bells ringing. The same would apply if the reading was 650 installs – a minute possibility.

Regardless of what application is being used to review the data, there is likely to be an in-built function or filter you can utilise to check very quickly the minimum and maximum within a specific field. Alternatively, you can plot the data onto a graph to look for any abnormal outliers.

Categorical Checking:

Values for some fields may only be applicable if they come from a specific set of options. For example, a device can only be physical or virtual. And it can only have been produced by a single manufacturer, which must also be a valid manufacturer (you’d be very surprised to see the data telling you a device has been manufactured by Poundland or McDonalds – or the more common, but equally invalid, “To be filled by O.E.M”).

In a similar vein to range checking, most data handling applications will have a search/filter/counting function you can utilise to examine all the values within a selected field. If there are just five possible valid categories for that field, any search/count of the values in that field should produce five (or fewer) results.

Consistency:

If validation has not been put in place at the data entry stage, you may end up with a myriad of values which will technically have an identical value but will not be interpreted that way by a computer. “Dell”, “Dell Inc” and “Dell, Inc.” would be a simple example of that.

Where the list of values is relatively small, a manual review of these can be performed through either a filter or duplicate removing function to examine the unique list. If consistency issues are discovered, the incorrect values can then be replaced with the correct value.

Missing Values:

Probably the most common type of data issue is when data simply isn’t there. This is often a result of combining datasets from multiple sources but may also be because of data inadvertently being omitted during the entry process.

Just one or two missing values probably isn’t an issue, but if there is a significantly high frequency of values missing, you need to ask yourself two things: what’s the likely cause, and how critical is this data to what you are trying to convey?

Missing data can lead to incorrect assumptions being made based on the data that is available, so don’t just ignore it there appears to be an inherent problem.

Found it; now let’s clean it

We’ve identified the problem areas within our data, but how do we go about the clean-up?

Manual Correction:

If you have identified one or more consistency issues, this is a quick and easy solution to implement. In smaller categorical sets, a simple find and replace or if statement(s) can bring the desired uniformity when dealing with text-based data.

Numerical consistency errors, such as order of magnitude mismatches, are simple to fix by multiplication or division. Binary consistency issues can be corrected if you can accurately assign the non-binary input to one of the binary categories.

Data-Driven Completion:

This technique is most closely associated with filling in missing values, but it can also be used for incorrect values. To boil it down, this is a fancy way to say “guess” – but this will be a guess driven entirely by the data we have at our disposal, and not a random stab in the dark.

An example of this would be a computer where the model is listed as an “OptiPlex 7090”, but the manufacturer is blank. Using what we know about the model, we can ascertain through a quick search that the manufacturer should be “Dell” (or for consistency’s sake, however “Dell” is listed in our dataset – it may be “Dell, Inc.” for example).

The same technique can also be used if the manufacturer was not left blank, but instead showed “Apple Inc.”. We may not immediately notice this error, but another search would reveal that either the manufacturer or model field is incorrect (i.e. Apple do not sell an OptiPlex model of their devices). Here we’d require a secondary piece of data to confirm which element is correct, and which needs adjusting (the operating system would be a good indicator in this case).

Flagging:

Sometimes, it is simply not possible to eradicate all the holes in your data. At the same time, these gaps cannot be left unexplained.

This can be done with the addition of an extra column to your dataset and highlighting any missing values there. Alternatively, a dedicated page/section of your report listing any applicable areas where shortcomings are prevalent.

Removal:

Not a recommended solution, and a last resort one at that. It should always be a priority to investigate any instances of missing values, but if a certain column is less than 25% complete and is not deemed critical to the overall objective, it may be preferable to remove that column.

Additionally, it is not a good idea to remove outliers simply because they do not fit, as this can be interpreted as results doctoring.

The only widely acceptable removal would be for de-duplication purposes when working with datasets from multiple sources.

Conclusion

Data cleansing is not a glamorous process and can be the most time-consuming aspect of a data scientists’ job. However, to produce both factual and useful insights, it is vital to have a rigorous cleansing process in place ahead of any presentation of the data.

If your organisation is struggling with any data related to your software asset management practices, please do not hesitate to get in touch with us at info@easysam.co.uk and we would be delighted to discuss how we may be able to assist you.

Share on:

Tagged under: