Maria Korolov
Contributing writer

10 things to watch out for with open source gen AI

Feature
May 15, 202412 mins
Artificial IntelligenceCIOData Management

Open source generative AI models can be downloaded for free, used at scale without racking up API call costs, and run securely behind corporate firewalls. But don’t let your guard down. Risks still exist and some aren’t only magnified, but new ones specific to gen AI are emerging.

team of successful business people having a meeting in executive sunlit office
Credit: ESB Basic / Shutterstock

It seems anyone can make an AI model these days. Even if you don’t have the training data or programming chops, you can take your favorite open source model, tweak it, and release it under a new name.

According to Stanford’s AI Index Report, released in April, 149 foundation models were released in 2023, two-thirds of them open source. And there are an insane number of variants. Hugging Face currently tracks more than 80,000 LLMs for text generation alone and fortunately has a leaderboard that lets you quickly sort the models by how they score on various benchmarks. And these models, though they lag behind the big commercial ones, are improving quickly.

Leaderboards are a good place to start when looking at open source gen AI, says David Guarrera, generative AI lead at EY Americas, and Hugging Face in particular has done a good job benchmarking, he says.

“But don’t underestimate the value of getting in there and playing with these models,” he says. “Because they’re open source, it’s easy to do that and swap them out.” And the performance gap between open source models and their closed, commercial alternatives is narrowing, he adds.

“Open source is great,” adds Val Marchevsky, head of engineering at Uber Freight. “I find open source extremely valuable.” Not only are they catching up to proprietary models in performance, but some offer levels of transparency that closed source can’t match, he says. “Some open source models allow you to see what’s used for inference and what’s not,” he adds. “Auditability is important for preventing hallucinations.”

Plus, of course, there’s the price advantage. “If you have a data center that happens to have capacity, why pay someone else?” he says.

Companies are already very familiar with using open source code. According to Synopsys’ open source security and risk analysis released in February, 96% of all commercial code bases contained open source components.

As a result of all this experience, companies should know what to do to make sure they’re using properly-licensed code, how to check for vulnerabilities, and how to keep everything patched and up-to-date. Some of those rules and best practices, though, have particular nuances that companies might overlook. Here are the top ones.

1. Weird new license terms

The landscape of different open source license types is complicated enough. Is a project safe for commercial use, or only for non-commercial implementations? Can it be modified and distributed? Can it be safely incorporated into a proprietary code base? Now, with gen AI, there are a few new wrinkles. First, there are new license types that are only open source under a very loose definition of the term.

Take the Llama license, for example. The Llama family of models are some of the best open source LLMs out there, but Meta officially describes it as a “bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse.”

Enterprises are allowed to use the models commercially, and for developers to create and distribute additional work on top of the base Llama models, but they’re not allowed to use Llama outputs to improve other LLMs unless they are themselves Llama derivatives. And if enterprises — or their affiliates — have more than 700 monthly users, they have to request a license that Meta may or may not grant. If they use Llama 3, they have to include “Built with Llama 3” in a prominent location.

Similarly, Apple just released OpenELM under the “Apple Sample Code License,” which is also invented for the occasion and covers only copyright permissions while excluding patent rights.

Neither Apple nor Meta use commonly accepted open source licenses, but the code is, in fact, open. Apple actually released not just the code, but also the model weights, the training data set, training logs, and pre-training configurations. Which brings us to the other aspect of open source licensing. Traditional open source software is just that — code. The fact it’s open source means you can see what it does and if there are potential problems or vulnerabilities in it.

Gen AI, however, isn’t just code. It’s also the training data, model weights, and fine tuning. All of those things are critical to understand how a model works and identify potential biases. A model trained on, say, an archive of flat earth conspiracy theories will be bad at answering science questions, or a model fine-tuned by North Korean hackers might be bad at correctly identifying malware. So do open source LLMs release all that information? It depends on the model, or even on the specific release of the model since there are no standards.

“Sometimes they make the code available, but if you don’t have the fine tuning, you could spend a lot of money getting to comparable performance,” says Anand Rao, professor of AI at Carnegie Mellon University and former global AI lead at PwC.

2. Skills shortages

Open source is often a do-it-yourself endeavor. Companies can download the code, but then they need in-house expertise or hired consultants to make everything work. This is a big problem in the gen AI space. Nobody has years of experience because the technology is so new. If a company is just starting out with gen AI, or if it wants to move quickly, it’s safer to start with a proprietary platform, says Rao.

“It takes expertise to download the open source version,” he says. But once a company’s done its proof of concept, deploys the model into production, and the bills start piling up, then it might be time to look at open source alternatives, he adds.

The lack of industry expertise also creates another problem for the open source gen AI space. One of the key advantages of open source is many people look at the code and can spot programming errors, security vulnerabilities, and other weaknesses. But this “thousand eyes” approach to open source security only works if there are, in fact, a thousand eyes capable of understanding what they’re seeing.

3. Jailbreaking

LLMs are notoriously susceptible to jailbreaking, where a user gives it a clever prompt that tricks it into violating its guidelines and, say, generating malware. With commercial projects, there are highly motivated vendors standing behind them who can identify these loopholes and close them as they pop up. In addition, vendors have access to the prompts that users send to the public versions of the models, so they can monitor for signs of suspicious activity.

Malicious actors are less likely to purchase enterprise versions of the products that run in private environments, where the prompts aren’t shared back to the vendor to improve the model. With an open source project, there might not be anyone on the team whose job it is to look for signs of jailbreaking. And bad actors can download these models for free and run them in their own environments in order to test potential hacks. The bad guys also get a head start on their jailbreaking since they can see the system prompt the model uses and any other guardrails that the model developers may have built.

“It’s not just trial and error,” says Rao. Attackers can analyze training data, for example, to figure out ways to get a model to misidentify images, or go off the rails when it comes across a prompt that looks innocuous.

If an AI model adds a watermark to its output, a malicious actor might analyze the code to reverse-engineer the process in order to take the watermark out. Attackers could also analyze the model or other supporting code and tools to find areas of vulnerability.

“You can overwhelm the infrastructure with requests so the model won’t,” says Elena Sügis, senior data scientist and capability lead at Nortal, a global digital transformation consultancy. “When the model is part of a larger system, and its output is used by another part of the system, if we can attack how the model produces the output, it’ll disrupt the whole system, which could be risky for the enterprise.”

4. Training data risks

Artists, writers, and other copyright holders are suing the big AI companies right and left. But what if they believe their IP rights are being infringed on by an open source model, and the only deep pockets around are those of the enterprises that have incorporated that model into their products or services? Could enterprise users get sued?

“It’s a potential issue and nobody really knows how some of the pending litigation is going to play out,” says EY’s Guarrera. We might be headed toward a world where there’ll have to be some compensation for the data sets, he says. “The large tech players are better positioned to have the money to spend on that and weather the storm that may come around copyright.”

The big commercial vendors don’t just have money to spend on buying training data and fighting lawsuits, they also have money to spend on curated data sets, says Sügis. Free, public data sets contain more than just copyrighted content used without permission. They’re also full of inaccurate and biased information, malware, and other materials that can degrade the quality of output.

“Many model developers are talking about using curated data,” she says. “And this is more expensive than if you throw the whole internet at it to train it.”

5. New areas of exposure

Since a gen AI project is more than just the code, there are more areas of potential exposure. An LLM can be attacked by bad actors on several fronts. They could infiltrate the development team on a poorly-governed project and add malicious code to the software itself. But they could also poison the training data, fine tuning, or the weights, says Sügis.

“Hackers might retrain the model with malicious code examples, so it invades user infrastructure,” she says. “Or they can train it with fake news and misinformation.”

Another attack vector is the model’s system prompt.

“This is usually hidden from the user,” she adds. “The system prompt might have guardrails or safety rules that would allow the model to recognize unwanted or unethical behavior.”

Proprietary models don’t reveal their system prompts, she says, and having access to it could allow hackers to figure out how to attack the model.

6. Missing guardrails

Some open source groups might have a philosophical objection to even having guardrails at all on their models, or they may believe that a model will perform better without any restrictions. And some are created specifically to be used for malicious purposes. Enterprises looking for an LLM to try out might not necessarily know which category their models fall into. There’s currently no independent body evaluating the safety of open source gen AI models, says Nortal’s Sügis. Europe’s AI Act will require some of this documentation, but most of its provisions won’t go into effect until 2026, she says.

“I would try to get as much documentation as possible, and test and evaluate the model and implement some guardrails inside the company,” she says.

7. Lack of standards

User-driven open source projects are often standards-based since enterprise users prefer to have them and interoperability. In fact, according to a survey of nearly 500 technology professionals released last year by the Linux Foundation, 71% prefer open standards, compared to 10% who prefer closed. Companies producing proprietary software, on the other hand, might prefer to have their customers trapped inside their ecosystems. But if you’re expecting open source gen AI to all be standards-based, you’d be wrong.

In fact, when most people talk about AI standards, they’re talking about things like ethics, privacy, and explainability. And there’s work happening in this area, such as the ISO/IEC 42001 standard for AI management systems, released in December last year. And, on April 29, NIST released a draft plan for AI standards which covers a lot of ground, starting with creating a common language for talking about AI. It also focuses largely on risk and governance issues. But there’s not much yet when it comes to technical standards.

“It’s an incredibly nascent space,” says Taylor Dolezal, CIO and head of ecosystems at the Cloud Native Computing Foundation. “I’m seeing some good conversations around data classifications, about having a standard format for training data, for APIs, for the prompts.” But, so far, it’s just conversations.

There’s already a common data standard for vector databases, he says, but no standard query language. And what about standards for autonomous agents?

“That I haven’t seen, but I’d love to see that,” he says. “Figuring out ways to not only have agents go about their specific tasks, but also how to tie that together.”

The most common tool for creating agents, LangChain, is more of a framework than a standard, he says. And the user companies, the ones who create demand for standards, aren’t ready yet, he says. “Most end users don’t know what they want until they start playing around with it.”

Instead, he says, people are more likely to look at the APIs and interfaces of major vendors, such as OpenAI, as nascent de-facto standards. “That’s what I’m seeing folks do,” he says.

8. Lack of transparency

You might think open source models are, by definition, more transparent. But that might not always be the case. Large commercial projects may have more resources to spend on creating documentation, says Eric Sydell, CEO at analytical engine and scoreboard platform Vero AI, which recently released a report scoring major gen AI models based on areas such as visibility, integrity, legislative preparedness, and transparency. Google’s Gemini and OpenAI’s GPT-4 ranked the highest.

“Just because they’re open source doesn’t necessarily mean they provide the same information about the background of the model and how it was developed,” says Sydell. “The bigger commercial models have done a better job around that at this point.”

Take bias, for example.

“We found that the top two closed models in our ranking had quite a bit of documentation and invested time exploring the issue,” he says.

9. Lineage issues

It’s common for open source projects to be forked, but when this happens with gen AI, you get risks you don’t get with traditional software. Say, for example, a foundation model uses a problematic training data set and from it, someone creates a new model, so it’ll inherit these problems, says Tyler Warden, SVP of product at Sonatype, a cybersecurity vendor.

“There’s a lot of black box aspects to it with the weights and turning,” he says.

In fact, those problems may go several levels back and won’t be visible in the code of the final model. When a company downloads a model for its own use, the model gets even further removed from the original sources. The original base model might have fixed the issues, but, depending on the amount of transparency and communication up and down the chain, the developers working on the last model might not even be aware of the fixes.

10. The new shadow IT

Companies that use open source components as part of their software development process have processes in place to vet libraries and ensure components are up to date. They make sure projects are well supported, security issues are dealt with, and the software has appropriate license terms.

With gen AI, however, the people supposed to do the vetting might not know what to look for. On top of that, gen AI projects sometimes fall outside the standard software development processes. They might come out of data science teams, or skunkworks. Developers might download the models to play with and end up getting more widely used. Or business users themselves might follow online tutorials and set up their own gen AI, bypassing IT altogether.

The latest evolution of gen AI, autonomous agents, have the potential to put enormous power in the hands of these systems, raising the risk potential of this type of shadow IT to new heights.

“If you’re going to experiment with it, create a container to do it in a way that’s safe for your organization,” says Kelley Misata, senior director of open source at Corelight. This should fall under the responsibility of a company’s risk management team, she says, and the person who makes sure that developers, and the business as a whole, understands there’s a process is the CIO.

“They’re the ones best positioned to set the culture,” she says. “Let’s tap into the innovation and all the greatness that open source offers, but go into it with eyes open.”

The best of both worlds?

Some companies are looking for the low cost, transparency, privacy, and control of open source, but would like to have a vendor around to provide governance, long term sustainability, and support. In the traditional open source world, there are many vendors who do that, like Red Hat, MariaDB, Docker, Automattic, and others.

“They provide a level of safety and security for large enterprises,” says Priya Iragavarapu, VP of data science and analytics at AArete. “It’s almost a way to mitigate risk.”

There aren’t too many of these vendors in the gen AI space, but things are starting to change, she says.