Should you build or buy generative AI?

Whether it’s text, images, video or, more likely, a combination of multiple models and services, taking advantage of generative AI is a ‘when, not if’ question for organizations.

Since the release of ChatGPT last November, interest in generative AI has skyrocketed. It’s already showing up in the top 20 shadow IT SaaS apps tracked by Productiv for business users and developers alike. But many organizations are limiting use of public tools while they set policies to source and use generative AI models. CIOs want to take advantage of this but on their terms—and their own data.

As so often happens with new technologies, the question is whether to build or buy. For generative AI, that’s complicated by the many options for refining and customising the services you can buy, and the work required to make a bought or built system into a useful, reliable, and responsible part of your organization’s workflow. Organizations don’t want to fall behind the competition, but they also want to avoid embarrassments like going to court, only to discover the legal precedent cited is made up by a large language model (LLM) prone to generating a plausible rather than factual answer.

From taking to making

Rather than a rigid distinction between building and buying such complex technology, Eric Lamarre, the senior partner leading McKinsey Digital in North America, suggests thinking in terms of taking, shaping and—in a very few cases—making generative AI models.

“As a ‘taker,’ you consume generative AI through either an API, like ChatGPT, or through another application, like GitHub Copilot, for software acceleration when you do coding,” he says. Finished apps that include generative AI may not offer much competitive differentiation, and answers they produce won’t always be perfect. But you want to adopt them to avoid competitive disadvantage, especially as they often arrive as new features in applications that staff already know how to use. “Every company will be doing that,” he adds. “In the shaper model, you’re leveraging existing foundational models, off the shelf, but retraining them with your own data.”

That reduces the ‘hallucination’ problem and gets you more accurate and relevant results.

“Contact center applications are very specific to the kind of products that the company makes, the kind of services it offers, and the kind of problems that have been surfacing,” he says. A general LLM won’t be calibrated for that, but you can recalibrate it—a process known as fine-tuning—to your own data. Fine-tuning applies to both hosted cloud LLMs and open source LLM models you run yourself, so this level of ‘shaping’ doesn’t commit you to one approach.

McKinsey tried to speed up writing evaluations by feeding transcripts of evaluation interviews to an LLM. But without fine-tuning or grounding it in the organization’s data, it was a complete failure, according to Lamarre. “The LLM didn’t have any context about the different roles, what kind of work we do, or how we evaluate people,” he says.

Generative AI models like ChatGPT and GPT4 with a plugin model let you augment the LLM by connecting it to APIs that retrieve real-time information or business data from other systems, add other types of computation, or even take action like open a ticket or make a booking. That includes curated data, like a legal database, in the same way you might add a commercial weather prediction service to a more traditional machine learning (ML) model for generating routes or predicting shipping times rather than build your own weather model from scratch.

Shaping will involve more than simply building an LLM into your own applications and processes, and organizations will need more sophisticated capabilities, Lamarre warned. “To get good output, you need to create a data environment that can be consumed by the model,” he says. “You need to have data engineering skills, and be able to recalibrate these models, so you probably need machine learning capabilities on your staff, and you need to be good at prompt engineering. So how do I coach my people to ask the right questions to get the best output?”

He cautioned CIOs against ‘shiny object syndrome’ with generative AI, especially if they haven’t already built up expertise in ML. “The reality that’s going to hit home in the next six to 12 months is generative AI is just as difficult as ‘traditional’ AI,” he says.

But with those skills, shaping generative AI systems created from existing models and services will deliver applications most likely to offer competitive differentiation. However, making will be even more challenging and, most likely, rare, Lamarre predicts.

Buy in or lose out

For smaller organizations like The Contingent, a non-profit supporting vulnerable children, families, and young professionals, even with 10 of its 60 staff working in technology and data research, building their own generative AI seems daunting to consider, according to CIO Peter Kim.

There’s a crisis in child welfare with support needs outpacing capacity, and he’s interested in how generative AI can help profile audiences, evaluate messaging around the continuum of opportunities for volunteering, match applicants with internships, and even reduce the time it takes to recruit new staff.

That will start with using the Copilot features Microsoft is introducing in many products, including in Cloud for Nonprofit. “It would seem almost foolish to pass this up, because it’s just going to become part of the norm,” he says. “If you’re not using it, you’re going to get left behind.”

But Kim also plans to customize some of the generative AI services available. He expects it to be particularly helpful for coding the many connectors the non-profit has to build for the disparate, often antiquated, systems government and private agencies use, and writing data queries. In addition, he hopes to understand nuances of geographical and demographic data, and extract insights from historical data and compare it to live data to identify patterns and opportunities to move quickly.

Rather than devote resources to replicate generative AI capabilities already available, that time and effort will go to automating existing manual processes and exploring new possibilities. “We’re not imagining utilizing AI to do the same things just because that’s the way we’ve always done it,” he says. “With this new superpower, how should we develop or refine refactoring these business processes?”

Buying rather than building will make it easier to take advantage of new capabilities as they arrive, he suggests. “I think one of the success of organizations in being able to utilize the tools that are becoming more readily available will lie in the ability to adapt and review.”

In a larger organization, using commercially available LLMs that come with development tools and integrations will allow multiple departments to experiment with different approaches, discover where generative AI can be useful, and get experience with how to use it effectively. Even organizations with significant technology expertise like Airbnb and Deutsche Telekom are choosing to fine-tune LLMs like ChatGPT rather than build their own.

“You take the large language model, and then you can bring it within your four walls and build that domain piece you need for your particular company and industry,” National Grid group CIDO Adriana Karaboutis says. “You really have to take what’s already there. You’re going to be five years out here doing a moonshot while your competitors layer on top of everything that’s already available.”

Panasonic’s B2B Connect unit used the Azure OpenAI Service to build its ConnectAI assistant for internal use by its legal and accounting teams, as well as HR and IT, and the reasoning was similar, says Hiroki Mukaino, senior manager for IT & digital strategy. “We thought it would be technically difficult and costly for ordinary companies like us that haven’t made a huge investment in generative AI to build such services on our own,” he says.

Increasing employee productivity is a high priority and rather than spend time creating the LLM, Mukaino wanted to start building it into tools designed for their business workflow. “By using Azure OpenAI Service, we were able to create an AI assistant much faster than build an AI in-house, so we were able to spend our time on improving usability.”

He also views the ability to further shape the generative AI options with plugins as a good way to customize it to Panasonic’s needs, calling plugins important functions to compensate for the shortcomings of the current ChatGPT.

Fine-tuning cloud LLMs by using vector embeddings from your data is already in private preview in Azure Cognitive Search for the Azure OpenAI Service.

“While you can power your own copilot using any internal data, which immediately improves the accuracy and decreases the hallucination, when you add vector support, it’s more efficient retrieving accurate information quickly,” Microsoft AI platform corporate VP John Montgomery says. That creates a vector index for the data source—whether that’s documents in an on-premises file share or a SQL cloud database—and an API endpoint to consume in your application.

Panasonic is using this with both structured and unstructured data to power the ConnectAI assistant. Similarly, professional services provider EY is chaining multiple data sources together to build chat agents, which Montgomery calls a constellation of models, some of which might be open source models. “Information about how many pairs of eyeglasses the company health plan covers would be in an unstructured document, and checking the pairs claimed for and how much money is left in that benefit would be a structured query,” he says.

Use and protect data

Companies taking the shaper approach, Lamarre says, want the data environment to be completely contained within their four walls, and the model to be brought to their data, not the reverse. While whatever you type into the consumer versions of generative AI tools is used to train the models that drive them (the usual trade-off for free services), Google, Microsoft and OpenAI all say commercial customer data isn’t used to train the models.

For example, you can run Azure OpenAI over your own data without fine-tuning, and even if you choose to fine-tune on your organization’s data, that customization, like the data, stays inside your Microsoft tenant and isn’t applied back to the core foundation model. “The data usage policy and content filtering capabilities were major factors in our decision to proceed,” Mukaino says.

Although the copyright and intellectual property aspects of generative AI remain largely untested by the courts, users of commercial models own the inputs and outputs of their models. Customers with particularly sensitive information, like government users, may even be able to turn off logging to avoid the slightest risk of data leakage through a log that captures something about a query.

Whether you buy or build the LLM, organizations will need to think more about document privacy, authorization and governance, as well as data protection. Legal and compliance teams already need to be involved in uses of ML, but generative AI is pushing the legal and compliance areas of a company even further, says Lamarre.

Unlike supervised learning on batches of data, an LLM will be used daily on new documents and data, so you need to be sure data is available only to users who are supposed to have access. If different regulations and compliance models apply to different areas of your business, you won’t want them to get the same results.

Source and verify

Adding internal data to a generative AI tool Lamarre describes as ‘a copilot for consultants,’ which can be calibrated to use public or McKinsey data, produced good answers, but the company was still concerned they might be fabricated. “We can’t be in the business of being wrong,” he says. To avoid that, it cites the internal reference an answer is based on, and the consultant using it is responsible to check for accuracy.

But employees already have that responsibility when doing research online, Karaboutis points out. “You need intellectual curiosity and a healthy level of skepticism as these language models continue to learn and build up,” she says. As a learning exercise for the senior leadership group, her team crated a deepfake video of her with a generated voice reading AI-generated text.

Apparently credible internal data can be wrong or just out of date, too, she cautioned. “How often do you have policy documents that haven’t been removed from the intranet or the version control isn’t there, and then an LLM finds them and starts saying ‘our maternity policy is this in the UK, and it’s this in the US.’ We need to look at the attribution but also make sure we clean up our data,” she says.

Responsibly adopting generative AI mirrors lessons learned with low code, like knowing what data and applications are connecting into these services: it’s about enhancing workflow, accelerating things people already do, and unlocking new capabilities, with the scale of automation, but still having human experts in the loop.

Shapers can differentiate

“We believe generative AI is beneficial because it has a much wider range of use and flexibility in response than conventional tools and service, so it’s more about how you utilize the tool to create competitive advantage rather than just the fact of using it,” Mukaino says.

Reinventing customer support, retail, manufacturing, logistics, or industry specific workloads like wealth management with generative AI will take a lot of work, as will setting usage policies and monitoring the impact of the technology on workflows and outcomes. Budgeting for those resources and timescales are essential, too. It comes down to can you build and rebuild faster than competitors that are buying in models and tools that let them create applications straight away, and let more people in their organization experiment with what generative AI can do?

General LLMs from OpenAI, and the more specialized LLMs built on top of their work like GitHub Copilot, improve as large numbers of people use them: the accuracy of code generated by GitHub Copilot has become significantly more accurate since it was introduced last year. You could spend half a million dollars and get a model that only matches the previous generation of commercial models, and while benchmarking isn’t always a reliable guide, these continue to show better results on benchmarks than open source models.

Be prepared to revisit decisions about building or buying as the technology evolves, Lamarre warns. “The question comes down to, ‘How much can I competitively differentiate if I build versus if I buy,’ and I think that boundary is going to change over time,” he says.

If you’ve invested a lot of time and resources in building your own generative models, it’s important to benchmark not just how they contribute to your organization but how they compare to the commercially available models your competition could adopt today, paying 10 to 15 cents for around a page of generated text, not what they had access to when you started your project.

Major investments

“The build conversation is going to be reserved for people who probably already have a lot of expertise in building and designing large language models,” Montgomery says, noting that Meta builds its LLMs on Azure, while Anthropic, Cohere, and Midjourney use Google Cloud infrastructure to train their LLMs.

Some organizations do have the resources and competencies for this, and those that need a more specialized LLM for a domain may make the significant investments required to exceed the already reasonable performance of general models like GPT4.

Training your own version of an open source LLM will need extremely large data sets: while you can acquire these from somewhere like Hugging Face, you’re still relying on someone else having curated them. Plus you’ll still need data pipelines to clean, deduplicate, preprocess, and tokenize the data, as well as significant infrastructure for training, supervised fine-tuning, evaluation, and deployment, as well as the deep expertise to make the right choices for every step.

There are multiple collections with hundreds of pre-trained LLMs and other foundation models you can start with. Some are general, others more targeted. Generative AI startup Docugami, for instance, began training its own LLM five years ago, specifically to generate the XML semantic model for business documents, marking up elements like tables, lists and paragraphs rather than the phrases and sentences most LLMs work with. Based on that experience, Docugami CEO Jean Paoli suggests that specialized LLMs are going to outperform bigger or more expensive LLMs created for another purpose.

“In the last two months, people have started to understand that LLMs, open source or not, could have different characteristics, that you can even have smaller ones that work better for specific scenarios,” he says. But he adds most organizations won’t create their own LLM and maybe not even their own version of an LLM.

Only a few companies will own large language models calibrated on the scale of the knowledge and purpose of the internet, adds Lamarre. “I think the ones that you calibrate within your four walls will be much smaller in size,” he says.

If they do decide to go down that route, CIOs will need to think about what kind of LLM best suits their scenarios, and with so many to choose from, a tool like Aviary can help. Consider the provenance of the model and the data it was trained on. These are similar questions that organizations have learned to ask about open source projects and components, Montgomery points out. “All the learnings that came from the open source revolution are happening in AI, and they’re happening much quicker.”

IDC’s AI Infrastructure View benchmark shows that getting the AI stack right is one of the most important decisions organizations should take, with inadequate systems the most common reason AI projects fail. It took more than 4,000 NVIDIA A100 GPUs to train Microsoft’s Megatron-Turing NLG 530B model. While there are tools to make training more efficient, they still require significant expertise—and the costs of even fine-tuning are high enough that you need strong AI engineering skills to keep costs down.

Docugami’s Paoli expects most organizations will buy a generative AI model rather than build, whether that means adopting an open source model or paying for a commercial service. “The building is going to be more about putting together things that already exist.” That includes using these emerging stacks to significantly simplify assembling a solution from a mix of open source and commercial options.

So whether you buy or build the underlying AI, the tools adopted or created with generative AI should be treated as products, with all the usual user training and acceptance testing to make sure they can be used effectively. And be realistic about what they can deliver, Paoli warns.

“CIOs need to understand they’re not going to buy one LLM that’s going to change everything or do a digital transformation for them,” he says.

Artificial Intelligence, Big Data, CIO, Generative AI, IT Leadership, Machine Learning