Multimodal Search Image Application with Titan Embedding

Shikha.sen . 14 May, 2024 • 11 min read

Introduction

In today’s world, where data comes in various forms, including text, images, and multimedia, there is a growing need for applications to understand and process this diverse information. One such application is a multimodal image search app, which allows users to search for images using natural language queries. In this blog post, we’ll explore how to build a multimodal image search app using Titan Embeddings from Amazon, FAISS (Facebook AI Similarity Search), and LangChain, an open-source library for building applications with large language models (LLMs).

Building such an app requires combining several cutting-edge technologies, including multimodal embeddings, vector databases, and natural language processing (NLP) tools. Following the steps outlined in this post, you’ll learn how to preprocess images, generate multimodal embeddings, index the embeddings using FAISS, and create a simple application that can take in natural language queries, search the indexed embeddings, and return the most relevant images.

Pre Requisites:

  • AWS Account: You’ll likely need an AWS account to access Bedrock and the specific model “amazon.titan-embed-image-v1”. This model suggests it’s for generating image embeddings.
  • Boto3 Library: The code uses the Boto3 library to interact with AWS services. Install it using pip install boto3.
  • IAM Permissions: Your AWS account needs appropriate IAM permissions to access Bedrock and invoke the specified model.

Basic Terminologies

Let us start off by understanding some basic terminologies.

AWS Bedrock

Amazon Bedrock is a fully managed service that provides a wide range of features you need to create generative AI applications with security, privacy, and responsible AI. It provides a single API for selecting high-performing foundation models (FMs) from top AI vendors like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon.

With Amazon Bedrock, you can quickly test and assess the best FMs for your use case and privately customize them with your data utilizing RAG and fine-tuning. It can also create agents that work with your enterprise systems and data sources to do tasks. You don’t need to manage any infrastructure because Amazon Bedrock is serverless. Moreover, you can safely integrate and use generative AI capabilities into your applications using the AWS services you are already familiar with.

Amazon Bedrock | MultiModal Search Image Application with Titan Embedding 

Amazon Titan Embeddings

With the help of Amazon Titan Embeddings, text embeddings, natural language text—including individual words, sentences, and even lengthy documents—may be transformed into numerical representations that can be utilized to enhance use cases like personalization, search, and clustering according to semantic similarity. Amazon Titan Embeddings, optimized for text retrieval to support Retrieval Augmented Generation (RAG) use cases, lets you leverage your exclusive data in conjunction with other FMs. It first converts your text data into numerical representations or vectors, which you can then use to search for pertinent passages from a vector database precisely.

English, Chinese, and Spanish are among the more than 25 languages that Titan Embeddings supports. It can function with single words, sentences, or full documents, depending on your use case, because you can input up to 8192 tokens. In addition to optimizing for low latency and cost-effective outcomes, the model yields output vectors with 1,536 dimensions, indicating its high degree of accuracy. You can use Titan Embeddings with a single API without managing any infrastructure because it’s available through Amazon Bedrock’s serverless experience.

Amazon Titan Embeddings is available in all AWS regions where Amazon Bedrock is available, including US East (N. Virginia) and US West (Oregon) AWS Regions.

MultiModal Search Image Application with Titan Embedding 

Vector Databases

Vector databases are specialized databases designed to store and retrieve high-dimensional data efficiently. This data is often represented as vectors, which are numerical arrays that capture the essential features or characteristics of the data point.

  • Traditional databases store data in tables with rows and columns. Vector databases, however, focus on storing and searching for
  • They achieve this by converting data (text, images, etc.) into numerical vectors using techniques like

Vector databases are powerful tools for applications that demand efficient retrieval based on similarity. Their ability to handle high-dimensional data and find semantic connections makes them valuable assets in various fields where similar data points hold significant value.

Also Read: How Does it Work & Top 15 Vector Databases 2024

FAISS Database

FAISS, a Facebook AI Similarity Search, is a free and open-source library that Meta (formerly Facebook) developed for efficient similarity search in high-dimensional vector spaces. It’s particularly well-suited for large datasets containing millions or even billions of vectors.

What Does It Do?

  • FAISS focuses on finding the nearest neighbors (most similar vectors) to a given query vector in a large dataset. This is crucial in various applications that involve comparing high-dimensional data points.
  • It achieves this by employing various indexing techniques that organize the data efficiently for faster retrieval. These techniques include:
  • Hierarchical structures
  • Product quantization

boto3

  • boto3 is the official Python library developed by Amazon Web Services (AWS) to interact with its extensive range of cloud services.
  • It provides a user-friendly and object-oriented interface, making it easier for developers to manage and utilize AWS resources programmatically in their Python applications.

Step-by-Step Implementation of Multimodal Search Image Application with Titan Embedding

Step 1: Libraries Installation

!pip install \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"\
    "langchain==0.1.16" \
    "langchain-openai==0.1.3"\
    "langchain-community==0.0.33"\
    "langchain-aws==0.1.0"\
    "faiss-cpu"
  1. boto3>=1.28.57: This is the AWS SDK for Python, the official library Amazon Web Services (AWS) provides for interacting with its vast cloud services ecosystem.
  2. awscli>=1.29.57: This is the AWS Command-Line Interface (CLI) for Python. It provides a command-line tool for interacting with AWS services directly from your terminal.
  3. botocore>=1.31.57: This is a lower-level library that underpins both boto3 and awscli. It provides the core functionality for requesting AWS services and handling responses.
  4. langchain==0.1.16: This library offers tools for building and working with large language models (LLMs). It provides functionalities like model loading, text generation, and fine-tuning. 
  5. langchain-openai==0.1.3: This extension for langchain integrates with OpenAI’s APIs, allowing you to interact with OpenAI’s LLMs like GPT-3.
  6. langchain-community==0.0.33: This extension for langchain provides community-developed tools and functionalities related to LLMs.
  7. langchain-aws==0.1.0: This extension for langchain might potentially provide integrations with AWS services specifically for working with LLMs. However, as it’s at version 0.1.0, the documentation and functionalities might be limited.
  8. faiss-cpu: This library implements the FAISS (Facebook AI Similarity Search) library for CPU-based processing. FAISS is a powerful tool for performing efficient similarity searches in high-dimensional data.

Step 2: Importing Necessary Libraries

Now lets import the required libraries.

import os
import boto3
import json
import base64
from langchain_community.vectorstores import FAISS
from io import BytesIO
from PIL import Image

Step 3: Generating Embeddings for Images

The first step is determining whether we will be processing text or images. We identify this using the get_multimodal_vector function. This takes the input and utilizes the Amazon Titan model through the InvokeModel API from Amazon Bedrock to generate a joint embedding vector for the image or text, if applicable.

# This function is named get_multimodal_vector and it takes two optional arguments
def get_multimodal_vector(input_image_base64=None, input_text=None):

  # Creates a Boto3 session object, likely to interact with AWS services
  session = boto3.Session()

  # Creates a Bedrock client object to interact with the Bedrock service
  bedrock = session.client(service_name='bedrock-runtime')

  # Creates an empty dictionary to hold the request data
  request_body = {}

  # If input_text is provided, add it to the request body with the key "inputText"
  if input_text:
    request_body["inputText"] = input_text

  # If input_image_base64 is provided, add it to the request body with the key "inputImage"
  if input_image_base64:
    request_body["inputImage"] = input_image_base64

  # Converts the request body dictionary into a JSON string
  body = json.dumps(request_body)

  # Invokes the model on the Bedrock service with the prepared JSON request
  response = bedrock.invoke_model(
    body=body,
    modelId="amazon.titan-embed-image-v1",
    accept="application/json",
    contentType="application/json"
  )

  # Decodes the JSON response body from Bedrock
  response_body = json.loads(response.get('body').read())

  # Extracts the "embedding" value from the response, likely the multimodal vector
  embedding = response_body.get("embedding")

  # Returns the extracted embedding vector
  return embedding

This function serves as a bridge between your Python application and the Bedrock service. It allows you to send image or text data and retrieve a multimodal vector. This potentially enables applications like image/text search, recommendation systems, or tasks requiring capturing the essence of different data types in a unified format.

Step 4: Get Vector From File

get_vector_from_file function takes an image file path, encodes the image to base64, generates an embedding vector using Titan Multimodal Embeddings, and returns the vector – allowing images to be represented as vectors

# This function takes a file path as input and returns a vector representation of the content
def get_vector_from_file(file_path):

  # Opens the file in binary reading mode ("rb")
  with open(file_path, "rb") as image_file:
    # Reads the entire file content as bytes
    file_content = image_file.read()

    # Encodes the binary file content into base64 string format
    input_image_base64 = base64.b64encode(file_content).decode('utf8')

  # Calls the get_multimodal_vector function to generate a vector from the base64 encoded image
  vector = get_multimodal_vector(input_image_base64=input_image_base64)

  # Returns the generated vector
  return vector

This function acts as a wrapper for get_multimodal_vector. It takes a file path, reads the file content, converts it to a format suitable for get_multimodal_vector (base64 encoded string), and ultimately returns the generated vector representation.

Helper Function 

Get the image vector from the directory.  

def get_image_vectors_from_directory(path_name):
  """
  This function extracts image paths and their corresponding vectors from a directory and its subdirectories.

  Args:
      path_name (str): The path to the directory containing images.

  Returns:
      list: A list of tuples where each tuple contains the image path and its vector representation.
  """

  items = []  # List to store tuples of (image_path, vector)

  # Get a list of filenames in the given directory
  sub_1 = os.listdir(path_name)

  # Loop through each filename in the directory
  for n in sub_1:
    # Check if the filename ends with '.jpg' (assuming JPG images)
    if n.endswith('.jpg'):
      # Construct the full path for the image file
      file_path = os.path.join(path_name, n)

      # Call the check_size_image function to potentially resize the image
      check_size_image(file_path)

      # Get the vector representation of the image using get_vector_from_file
      vector = get_vector_from_file(file_path)

      # Append a tuple containing the image path and vector to the items list
      items.append((file_path, vector))
    else:
      # If the file is not a JPG, check for JPGs within subdirectories
      sub_2_path = os.path.join(path_name, n)  # Subdirectory path
      for n_2 in os.listdir(sub_2_path):
        if n_2.endswith('.jpg'):
          # Construct the full path for the image file within the subdirectory
          file_path = os.path.join(sub_2_path, n_2)

          # Call the check_size_image function to potentially resize the image
          check_size_image(file_path)

          # Get the vector representation of the image using get_vector_from_file
          vector = get_vector_from_file(file_path)

          # Append a tuple containing the image path and vector to the items list
          items.append((file_path, vector))
        else:
          # Print a message if a file is not a JPG within the subdirectory
          print(f"Not a JPG file: {n_2}")

  # Return the list of tuples containing image paths and their corresponding vectors
  return items

This function takes a directory path (path_name) as input and aims to create a list of tuples. Each tuple contains the path to an image file (expected to be a JPG) and its corresponding vector representation.

Check Image Size

def check_size_image(file_path):
  """
  This function checks if an image exceeds a predefined maximum size and resizes it if necessary.

  Args:
      file_path (str): The path to the image file.

  Returns:
      None
  """

  # Maximum allowed image size (replace with your desired limit)
  max_size = 2048

  # Open the image using Pillow library (assuming it's already imported)
  try:
      image = Image.open(file_path)
  except FileNotFoundError:
      print(f"Error: File not found - {file_path}")
      return

  # Get the image width and height in pixels
  width, height = image.size

  # Check if either width or height exceeds the maximum size
  if width > max_size or height > max_size:
    print(f"Image '{file_path}' exceeds maximum size: width: {width}, height: {height} px")

    # Calculate the difference between current size and maximum size for both dimensions
    dif_width = width - max_size
    dif_height = height - max_size

    # Determine which dimension needs the most significant resize based on difference
    if dif_width > dif_height:
      # Calculate the scaling factor based on the width exceeding the limit most
      scale_factor = 1 - (dif_width / width)
    else:
      # Calculate the scaling factor based on the height exceeding the limit most
      scale_factor = 1 - (dif_height / height)

    # Calculate new width and height based on the scaling factor
    new_width = int(width * scale_factor)
    new_height = int(height * scale_factor)

    print(f"Resized image dimensions: width: {new_width}, height: {new_height} px")

    # Resize the image using the calculated dimensions
    new_image = image.resize((new_width, new_height))

    # Save the resized image over the original file (be cautious about this)
    new_image.save(file_path)

  # No resizing needed, so we don't modify the image file
  return#i

This function checks if an image exceeds a predefined maximum size and resizes it if necessary.

Step 5: Creates and returns an in-memory vector store to be used in the application

def create_vector_db(path_name):
  """
  This function creates a vector database from image files in a directory.

  Args:
      path_name (str): The path to the directory containing images.

  Returns:
      FAISS index object: The created vector database using FAISS.
  """

  # Get a list of (image_path, vector) tuples from the directory
  image_vectors = get_image_vectors_from_directory(path_name)

  # Extract text embeddings (assumed to be empty strings) and image paths
  text_embeddings = [("", item[1]) for item in image_vectors]  # Empty string, vector
  metadatas = [{"image_path": item[0]} for item in image_vectors]

  # Create a FAISS index using the extracted text embeddings (might be empty)
  # and image paths as metadata
  db = FAISS.from_embeddings(
      text_embeddings=text_embeddings,
      embedding=None,  # Not explicitly setting embedding (might depend on image_vectors)
      metadatas=metadatas
  )

  # Print information about the created database
  print(f"Vector Database: {db.index.ntotal} docs")

  # Return the created FAISS index object (database)
  return db
# Unzips the archive named "animals.zip" (assuming it's in the current directory)
!unzip animals.zip

# Defines the base path for the extracted animal files (replace with your actual path if needed)
path_file = "./animals"

# Creates the full path name by combining the base path and potentially an empty string
path_name = f"{path_file}"  

# Calls the function to create a vector database from the extracted animal files
db = create_vector_db(path_name)

Step 6: Save to Local Vector Database

The next step is to save it to the local vector database.

# Define the filename for the vector database
db_file = "animals.vdb"

# Save the created vector database (FAISS index object) to a local file
db.save_local(db_file)

# Print a confirmation message indicating the filename where the database is saved
print(f"Vector database was saved in {db_file}")

Step 7: Query by text

# Define the query text to search for
query = "dog"

# Get a multimodal vector representation of the query text using get_multimodal_vector
search_vector = get_multimodal_vector(input_text=query)

# Perform a similarity search in the vector database using the query vector
results = db.similarity_search_by_vector(embedding=search_vector)

# Iterate over the returned search results
for res in results:

  # Extract the image path from the result metadata
  image_path = res.metadata['image_path']

  # Open the image file in binary reading mode
  with open(image_path, "rb") as f:
    # Read the image content as bytes
    image_data = f.read()

    # Create a BytesIO object to hold the image data in memory
    img = BytesIO(image_data)

    # Open the image from the BytesIO object using Pillow library
    image = Image.open(img)

    # Display the retrieved image using Pillow's show method
    image.show()#

Output

MultiModal Search Image Application with Titan Embedding and FAISS

Conclusion

This article taught us how to build a multimodal smart image search tool using Titan Embeddings, FAISS, and LangChain. This tool lets users find images using everyday language, making image searches easier and more intuitive. We covered everything step by step, from preparing images to creating search functions. Developers can use AWS Bedrock, Boto3, and free software to make strong, scalable tools that handle different kinds of data. Now, developers can create smart search tools, combining data types to improve search results and user experiences.

Key Takeaways

  1. Multimodal Data Processing: The integration of image and text processing technologies enables the development of powerful multimodal applications. This is capable of understanding and processing diverse data types.
  2. Efficient Vector Search: FAISS provides efficient similarity search capabilities in high-dimensional vector spaces. Therefore, it is ideal for large-scale image retrieval tasks.
  3. Cloud-based AI Services: Leveraging cloud-based AI services like AWS Bedrock simplifies the development and deployment of AI-powered applications. Thus enabling developers to focus on building innovative solutions.
  4. Open-source Libraries: Utilizing open-source libraries like LangChain allows developers to access advanced language model functionalities and integrate them seamlessly into their applications.
  5. Scalability and Flexibility: The architecture presented in this guide offers scalability and flexibility. Hence, it is suitable for various use cases, from small-scale prototypes to large-scale production systems.

Frequently Asked Questions

Q1. Can I use this approach for other types of multimodal data, such as audio and text?

A. While this article focuses on images and text, similar approaches can be adapted for other types of multimodal data, such as audio and text. The key is to leverage appropriate models and techniques for each data modality and ensure compatibility with the chosen vector database and search algorithms.

Q2. How can I fine-tune the performance of the image search system?

A. Performance tuning can involve various strategies, including optimizing model parameters, fine-tuning embeddings, adjusting search algorithms and parameters, and optimizing infrastructure resources. Experimentation and iterative refinement are key to achieving optimal performance.

Q3. Are there any privacy or security considerations when using cloud-based AI services like AWS Bedrock?

A. When using cloud-based AI services, it’s essential to consider privacy and security implications, especially when dealing with sensitive data. Ensure compliance with relevant regulations, implement appropriate access controls and encryption mechanisms, and regularly audit and monitor the system for security vulnerabilities.

Q4. Can I deploy this image search application in a production environment?

A. Yes, the architecture presented in this article is suitable for deployment in production environments. However, before production deployment, ensure proper scalability, reliability, performance testing, and compliance with relevant operational best practices and security standards.

Q5. Are there alternative cloud platforms and services that offer similar capabilities to AWS Bedrock?

A. Yes, several alternative cloud platforms and services offer similar capabilities for AI model hosting, such as Google Cloud AI Platform, Microsoft Azure Machine Learning, and IBM Watson. Evaluate each platform’s features, pricing, and ecosystem support to determine the best fit for your requirements.

Shikha.sen . 14 May 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear