Updated: May 27, 2019
So to recap, if you know data structure, you understand how data is stored and you leave yourself clues to do things faster next time.
Now the other part of the equation is knowing if the data you are using is the right data. Finding data quickly doesn’t do you any good if you bring back the wrong data.
So, how do you know if the data you are using is the right data to be using? I can’t count the number of times I asked myself that question. In general, just about every new analysis or project or research or whatever it is you are using data for, you have to ask that question at some point.
Even data you have used a hundred times and comes from a highly trusted source needs to be scrutinized.
Now if you work with data every day in a familiar format, from the same source and with no changes to the data gathering and storage process you don’t have to spend much time validating it. Usually you will see problems when something just doesn’t look right when you are doing the analysis.
On the other hand, things get a whole lot trickier when you are using data from a source you don’t use often, or something has changed in the way the data is populated or if it’s the first time you are using the data.
When this happens, I have a few suggestions on how to validate the data.
First off, pull the data, do your analysis and draw some conclusions. If it passed the eye test and it feels ok to you, then your job is just to validate it.
One simple way to do this is pull the data again the exact same way to make sure you get the exact same data. Or change one parameter like the dates used in the query. See if that significantly alters the way the data looks and feels.
Another option is to have someone else do the same thing independently. See if they get the same results you do. You can also find someone who knows the data to look over your work to see if it makes sense to them.
Whatever you do, the best way to prevent publishing or using bad data is to involve someone else. Not always possible, I know, but it’s the best way to go.
Another suggestion is to (1) get the data, (2) do some analysis, and then (3) step away for a while. Come back to it with fresh eyes. Don’t let our minds play tricks on us by making us see what we want to see and not what is really there.
I have seen several articles showing research that most time doing data analysis is actually spent cleaning data. In a lot of businesses, the data lake has become a data swamp, clogged with bad or unusable data.
As the % of unstructured data increases daily, it’s easy to see how data swamps have become the norm. Even the most robust data collection and mining can run afoul if the data is not trustworthy.
I can’t stress this enough. No matter how good you are at analysis, or what tool you are using to do the analysis, if you don’t have an understanding of what happens to the data before it gets to you then you are probably not drinking from a clean lake.