How to Run Llama 3 Locally?

Sunil Kumar Dash 23 Apr, 2024 • 5 min read

Introduction

Discover the latest milestone in AI language models with Meta’s Llama 3 family. From advancements like increased vocabulary sizes to practical implementations using open-source tools, this article dives into the technical details and benchmarks of Llama 3. Learn how to deploy and run these models locally, unlocking their potential within consumer hardware.

Learning Objectives

  • Understand the key advancements and benchmarks of the Llama 3 family of models, including their performance compared to previous iterations and other models in the field.
  • Learn how to deploy and run Llama 3 models locally using open-source tools like HuggingFace Transformers and Ollama, enabling hands-on experience with large language models.
  • Explore the technical enhancements in Llama 3, such as the increased vocabulary size and implementation of Grouped Query Attention, and understand their implications for text generation tasks.
  • Gain insights into the potential applications and future developments of Llama 3 models, including their open-source nature, multi-modal capabilities, and ongoing advancements in fine-tuning and performance.

This article was published as a part of the Data Science Blogathon.

Introduction of Llama 3

Introducing the Llama 3 family: a new era in language models. With pre-trained base and chat models available in 8B and 70B sizes, it brings forth significant advancements. These include an expanded vocabulary size, now at 128k tokens, enhancing token encoding efficiency and enabling better multi-lingual text generation. Additionally, it implements Grouped Query Attention (GQA) across all models, ensuring more coherent and extended responses compared to its predecessors.

Furthermore, Meta’s rigorous training regimen, utilizing 15 trillion tokens for the 8B model alone, signifies a commitment to pushing the boundaries of natural language processing. With plans for multi-modal models and even larger 400B+ models on the horizon, the Llama 3 series heralds a new era of AI language modeling, poised to revolutionize various applications across industries.

You can click here to access model.

Performance Highlights

  • Llama 3 models excel in various tasks like creative writing, coding, and brainstorming, setting new performance benchmarks.
  • The 8B Llama 3 model outperforms previous models by significant margins, nearing the performance of the Llama 2 70B model.
  • Notably, the Llama 3 70B model surpasses closed models like Gemini Pro 1.5 and Claude Sonnet across benchmarks.
  • Open-source nature allows for easy access, fine-tuning, and commercial use, with models offering liberal licensing.
Meta's Llama 3

Running Llama 3 Locally

Llama 3 with all these performance metrics is the most appropriate model for running locally. Thanks to the advancement in model quantization method we can run the LLM’s inside consumer hardware. There are different ways to run these models locally depending on hardware specifications. If your system has enough GPU memory (~48GB), you can comfortably run 8B models with full precision and a 4-bit quantized 70B model. Output might be on the slower side. You may also use cloud instances for inferencing. Here, we will use the free tier Colab with 16GB T4 GPU for running a quantized 8B model. The 4-bit quantized model requires ~5.7 GB of GPU memory, which is fine for running on T4 GPU. 

To run these models, we can use different open-source tools. Here are a few tools for running models locally.

Using HuggingFace

HuggingFace has already rolled out support for Llama 3 models. We can easily pull the models from HuggingFace Hub with the Transformers library. You can install the full-precision models or the 4-bit quantized ones. This is an example of running it on the Colab free tier.

Step1: Install Libraries

Install accelerate and bitsandbytes libraries and upgrade the transformers library.

!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Step2: Install Model

Now we will install the model and start querying.

import transformers
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

Step3: Send Queries

Now send queries to the model for inferencing.

messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Generate an approximately fifteen-word sentence 
                                   that describes all this data:
                                   Midsummer House eatType restaurant; 
                                   Midsummer House food Chinese; 
                                   Midsummer House priceRange moderate; 
                                   Midsummer House customer rating 3 out of 5; 
                                   Midsummer House near All Bar One"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

Output of the query: “Here is a 15-word sentence that summarizes the data:

Midsummer House is a moderate-priced Chinese eatery with a 3-star rating near All Bar One.”

Step4: Install Gradio and Run Code

You can wrap this inside a Gradio to have an interactive chat interface. Install Gradio and run the code below.

import gradio as gr

messages = []

def add_text(history, text):
    global messages  #message[list] is defined globally
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:
    
    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)
            
demo.queue()
demo.launch(debug=True)

Here is a demo of the Gradio app and Llama 3 in action.

Using Ollama

Ollama is another open-source software for running LLMs locally. To use Ollama, you have to download the software.

Step1: Starting Local Server

Once downloaded use this command to start a local server. 

ollama run llama3:instruct  #for 8B instruct model

ollama run llama3:70b-instruct #for 70B instruct model

ollama run llama3  #for 8B pre-trained model

ollama run llama3:70b #for 70B pre-trained

Step2: Query Through API

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step3: JSON Response

You will receive a JSON response.

{
  "model": "llama3",
  "created_at": "2024-04-19T19:22:45.499127Z",
  "response": "The sky is blue because it is the color of the sky.",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Conclusion

We have discovered not just advances in language modeling but also useful implementation strategies of Llama 3. Running Llama 3 locally is now possible because to technologies like HuggingFace Transformers and Ollama, which opens up a wide range of applications across industries. Looking ahead, Llama 3’s open-source design encourages innovation and accessibility, opening the door for a time when advanced language models will be accessible to developers everywhere.

Key Takeaways

  • Meta has unveiled the Llama 3 family of models containing four models, 8B, and 70B pre-trained and instruction-tuned models.
  • The models have performed exceedingly well across multiple benchmarks in their respective weight categories.
  • Llama 3 now uses a different tokenizer than Llama 2 with an increased vocan size. Now all the models are equipped with Grouped Query Attention (GQA) for better text generation.
  • While the models are big it is possible to run them on consumer hardware using quantization using open-source tools like Ollama and HiggingFace Transformers.

Frequently Asked Question

Q1. What is Llama 3?

A. Llama 3 is a family of large language models from Meta AI. There are two models 8B and 70B with both a pre-trained base model and an instruction-tuned model for chat application.

Q2. Is Llama 3 open-source?

A. Yes, it is open-source. The model can be deployed commercially and further fine-tuned on custom datasets. 

Q3. Is Llama 3 multi-modal?

A. The first batch of these models is not multi-modal but Meta has confirmed the future release of multi-modal models.

Q4. Is Llama 3 better than ChatGPT?

A. The Llama 3 70B model is better than GPT 3.5 but it is still not better than GPT 4.

Q5. What has changed in Llama 3 over Llama 2?

A. The new Llama 3 models use a different tokenizer with a larger vocabulary making it better at long context generation. All the models now use Grouped Query Attention for better answer generation. The models have been extensively trained over massive amounts of datasets making it better than Llama 2.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar Dash 23 Apr 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear