The meaning of data quality for augmented AI initiatives

Disclaimer: no AI was used in writing of this piece.
One of the principles in data management is called GIGO: Garbage In – Garbage Out. It is not a new principle, but it is not going away either. The idea in GIGO is that if you let low-quality data into being processed, the outcome is also of low quality.
Just to ensure that we are discussing of the same thing, a couple of things about data quality first. The commonly-encountered data quality characteristics are:
- Accuracy – Is the data correct?
- Completeness – Is the data comprehensive? Nothing is missing?
- Consistency – Is the data, structures, values the same within the context? Any conflicting data?
- Relevance – Does the data relate to what to the matter at hand?
- Reliability – Can you trust the data? Is it manipulated? Synthetic?
- Timeliness – Anything out of date included?
In order to meet these data quality characteristics, one should start looking into methodologies and tools for:
- Data profiling – assessing the data sources and looking into what is the level of garbageness in each source
- Data cleansing and preprocessing – based on the profile one could start looking into what should be made to cleanse the data before letting it go further in the process
- Dataset building and selection – what data meets the relevance and the timeliness requirements
- Data integration – how to send the data downstream securely – into the RAG pipeline, analysis tools etc.
Sounds laboursome. Why can’t we just use the data as is, since we have it at hand?
Well.
It really depends on what you want out of your augmented AI initiative. Taking care of the data quality and how it is processed and governed will have upsides.
- Improved accuracy of the AI outcome. You might actually get what you hope for.
- Improved reliability. You might actually dare to take actions based on the outcome. For example synthetic data can degrade the quality of the model over time, if the synthetic data is composed a bit skewed from the get-go.
- Less hallucinations and reduced bias. Incomplete data might obfuscate some patterns.
We are certainly going through very interesting times on the field of data. New innovations and tools emerge as you read this piece. But one thing won’t change. If you put Garbage In you get Garbage Out.
Simple as that.