The emergence of generative AI prompted several prominent companies to restrict its use because of the mishandling of sensitive internal data. According to CNN, some companies imposed internal bans on generative AI tools while they seek to better understand the technology and many have also blocked the use of internal ChatGPT.

Companies still often accept the risk of using internal data when exploring large language models (LLMs) because this contextual data is what enables LLMs to change from general-purpose to domain-specific knowledge. In the generative AI or traditional AI development cycle, data ingestion serves as the entry point. Here, raw data that is tailored to a company’s requirements can be gathered, preprocessed, masked and transformed into a format suitable for LLMs or other models. Currently, no standardized process exists for overcoming data ingestion’s challenges, but the model’s accuracy depends on it.

 4 risks of poorly ingested data

  1. Misinformation generation: When an LLM is trained on contaminated data (data that contains errors or inaccuracies), it can generate incorrect answers, leading to flawed decision-making and potential cascading issues. 
  2. Increased variance: Variance measures consistency. Insufficient data can lead to varying answers over time, or misleading outliers, particularly impacting smaller data sets. High variance in a model may indicate the model works with training data but be inadequate for real-world industry use cases.
  3. Limited data scope and non-representative answers: When data sources are restrictive, homogeneous or contain mistaken duplicates, statistical errors like sampling bias can skew all results. This may cause the model to exclude entire areas, departments, demographics, industries or sources from the conversation.
  4. Challenges in rectifying biased data: If the data is biased from the beginning, “the only way to retroactively remove a portion of that data is by retraining the algorithm from scratch.” It is difficult for LLM models to unlearn answers that are derived from unrepresentative or contaminated data when it’s been vectorized. These models tend to reinforce their understanding based on previously assimilated answers.

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. The groundwork of training data in an AI model is comparable to piloting an airplane. If the takeoff angle is a single degree off, you might land on an entirely new continent than expected.

The entire generative AI pipeline hinges on the data pipelines that empower it, making it imperative to take the correct precautions.

4 key components to ensure reliable data ingestion

  1. Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata. This may also entail working with new data through methods like web scraping or uploading. Data governance is an ongoing process in the data lifecycle to help ensure compliance with laws and company best practices.
  2. Data integration: These tools enable companies to combine disparate data sources into one secure location. A popular method is extract, load, transform (ELT). In an ELT system, data sets are selected from siloed warehouses, transformed and then loaded into source or target data pools. ELT tools such as IBM® DataStage® facilitate fast and secure transformations through parallel processing engines. In 2023, the average enterprise receives hundreds of disparate data streams, making efficient and accurate data transformations crucial for traditional and new AI model development.
  3. Data cleaning and preprocessing: This includes formatting data to meet specific LLM training requirements, orchestration tools or data types. Text data can be chunked or tokenized while imaging data can be stored as embeddings. Comprehensive transformations can be carried out using data integration tools. Also, there may be a need to directly manipulate raw data by deleting duplicates or changing data types.
  4. Data storage: After data is cleaned and processed, the challenge of data storage arises. Most data is hosted either on cloud or on-premises, requiring companies to make decisions about where to store their data. It’s important to caution using external LLMs for handling sensitive information such as personal data, internal documents or customer data. However, LLMs play a critical role in fine-tuning or implementing a retrieval-augmented generation (RAG) based- approach. To mitigate risks, it’s important to run as many data integration processes as possible on internal servers. One potential solution is to use remote runtime options like .

Start your data ingestion with IBM

IBM DataStage streamlines data integration by combining various tools, allowing you to effortlessly pull, organize, transform and store data that is needed for AI training models in a hybrid cloud environment. Data practitioners of all skill levels can engage with the tool by using no-code GUIs or access APIs with guided custom code.

The new DataStage as a Service Anywhere remote runtime option provides flexibility to run your data transformations. It empowers you to use the parallel engine from anywhere, giving you unprecedented control over its location. DataStage as a Service Anywhere manifests as a lightweight container, allowing you to run all data transformation capabilities in any environment. This allows you to avoid many of the pitfalls of poor data ingestion as you run data integration, cleaning and preprocessing within your virtual private cloud. With DataStage, you maintain complete control over security, data quality and efficacy, addressing all your data needs for generative AI initiatives.

While there are virtually no limits to what can be achieved with generative AI, there are limits on the data a model uses—and that data may as well make all the difference.

Book a meeting to learn more Try DataStage with the data integration trial
Was this article helpful?
YesNo

More from Artificial intelligence

AI transforms the IT support experience

5 min read - We know that understanding clients’ technical issues is paramount for delivering effective support service. Enterprises demand prompt and accurate solutions to their technical issues, requiring support teams to possess deep technical knowledge and communicate action plans clearly. Product-embedded or online support tools, such as virtual assistants, can drive more informed and efficient support interactions with client self-service. About 85% of execs say generative AI will be interacting directly with customers in the next two years. Those who implement self-service search…

Bigger isn’t always better: How hybrid AI pattern enables smaller language models

5 min read - As large language models (LLMs) have entered the common vernacular, people have discovered how to use apps that access them. Modern AI tools can generate, create, summarize, translate, classify and even converse. Tools in the generative AI domain allow us to generate responses to prompts after learning from existing artifacts. One area that has not seen much innovation is at the far edge and on constrained devices. We see some versions of AI apps running locally on mobile devices with…

Chat with watsonx models

3 min read - IBM is excited to offer a 30-day demo, in which you can chat with a solo model to experience working with generative AI in the IBM® watsonx.ai™ studio.   In the watsonx.ai demo, you can access some of our most popular AI models, ask them questions and see how they respond. This gives users a taste of some of the capabilities of large language models (LLMs). AI developers may also use this interface as an introduction to building more advanced…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters