The AI data dilemma every CIO must address

Getting data right for AI is essential for CIOs to deliver successful outcomes from AI initiatives. That part is clear. What’s less clear is what that process entails given the nature of AI data use — and how to pay for the foundational work necessary to ensure the organization has data that’s “good” for AI.

At issue is the face that AI makes use of data that many traditional applications don’t — and that the data best-suited for AI workflows isn’t always of the highest quality. Instead, what makes AI data “good” is that it fits the specifics of the business use cases and algorithms that use it. Consequently, it might be perfectly fine to use data that is incomplete or not “squeaky clean” — as long as it fits the use case.

Should CIOs care about this data quandary? Yes, for two reasons:

First, IT data analysts must be reoriented to produce the “right” data for AI, even if the data by traditional standards seems “wrong.” This will require revisions to data management work practices and some reorientation for data analysts tasked to working with AI.

Second, any data work, whether for traditional apps or AI, takes time and resources. It is also infrastructure-level work that no one “on the outside” — that is, the CEO and the C-level — sees tangible value in. So how do CIOs explain the necessity of doing this new data prep groundwork for AI in order to obtain the budget to do it?

Accept the AI data quality paradox

“In production AI, clean data is rare — but valuable data is everywhere,” says Isha Khatana, a machine learning engineer and data analyst. Khatana goes on to say that building smarter AI systems is about embracing the reality of “logs full of typos, sensor readings that randomly freeze, category names that change monthly and values that are ‘manually adjusted’ by someone from a different team.”

This instability flies in the face of data management and governance practices for traditional IT systems — and can be hair-raising for CIOs. Nevertheless, inconsistent and fluctuating data are realities for AI if the AI is to pull data from a variety of different sources that may or not be adequately curated. As Khatana observes from her own experience, “Real data is messy. Real impact comes from making sense of it anyway.”

So how do CIOs make sense of incomplete or garbled data? First, by explaining to AI stakeholders and management that the data AI uses is by no means “normal” in terms of the data quality standards that IT traditionally sets — and that the necessity of using less than perfect data for AI exists because AI must be fully informed with whatever data is “out there” and relevant if it is to have a full grasp of its subject domain.

This explanation about how AI uses non-standard data is important because working with non-standard data is going to require a different set of data management practices and skills from data analysts who prepare data for AI. Consequently, the CEO and other business stakeholders will see new data preparation tasks pop up in AI projects, and these new tasks will consume time, resources, and dollars. Because most of these stakeholders see data preparation as non-value-added grunt work, they won’t like what they see.

It will be up to the CIO to explain to stakeholders why AI requires working with different types of data that must be prepared differently. One way to impress the necessity of this data preparation “grunt work” is to point to the risks to the company if an AI system delivers faulty results because of an imperfect algorithm or data that wasn’t properly prepared.

Define data preparation schemes tailored to each AI project

Each AI project is unique when it comes to data preparation — but there are some overall guidelines that can be applied.

First comes the acknowledgment that, because of AI’s variegated data sources, some data incoming to AI will be less than perfect. An automated machine learning function that relies directly on the data it ingests, without necessarily screening that data for accuracy, is one example. Another example is an AI system that relies on sensor-generated data. In some cases, that data will be jitter — and it will need to be removed. In other cases, such as the modeling of a molecule for developing a vaccine, the incoming body of data from worldwide research might be so large that the pipeline for collecting that data must be purposefully narrowed only to research that specifically mentions the molecule being studied by name.

This is AI governance work, and it requires a different set of data analysis skills that go beyond traditional extracting, loading, and transforming data — and into the assessment of different types of data within the AI context that the data is being used.

‘Projectize’ AI data preparation tasks

It can be tempting for an AI project manager to “bury” data preparation work into other AI tasks so the data “grunt work” isn’t visible. Unfortunately, this leads to project deadlines being missed and costs exceeding budgets.

It’s better to be upfront with management about the need for special AI data preparation, and to include data preparation tasks in the AI project plan. Most CEOs will understand — because no one wants a costly business misstep and a PR embarrassment because the AI went awry.