--> Nikola Dakić - Blog About Software Engineering, DevOps and AI – How to Run LLM Locally

How to Run LLM Locally

Posted on Sat 02 December 2023 in AI


LLM Chatbot

What is Large Language Model (LLM)?

Large Language Models are a new class of models that are trained on large amounts of text data. They are able to generate text that is indistinguishable from human-written text. LLMs are trained on a large corpus of text, such as Wikipedia, and then fine-tuned on a specific task, such as summarization or translation. There is plenty of LLMs available, but one of the most popular and free to use are Code Llama models from Meta, so we will use them in this article.

Introducing Code Llama

Code Llama is a state-of-the-art LLM, capable of generating code and natural language about code, from both code and natural language prompts. Code Llama is built on top of Llama 2 and is available in three models:

  • Code Llama - The foundational code model
  • Codel Llama - Python specialized code model
  • Code Llama Instruct - Fine-tuned for understanding natural language instructions

For each of these models, Meta provide a small (7B), medium (13B), and large version (34B).
You can find more information about Code Llama here.

In this article, our end goal is to provide a natural language instruction for code generation, so we will use the Code Llama Instruct model. I'm having Macbook Pro with M1 chip, with 16GB of RAM, so I will use the small version of the model, and in this article I will provide instructions for macOS with Silicon M1 chip. But the instructions are similar for other models and other operating systems. You just need to choose the right version of the model and to install the right version of the dependencies.

GGUF Format

GGUF is a new format introduced by the llama.cpp team on August 2023. It is a replacement for GGML. GGUF offers numerous advantages over GGML, such as better tokenisation and support for special tokens. Don't worry about the details, let's just say that we will use models with GGUF format.

How to Run (Code Llama Instruct) Model Locally

Like always, everything starts with the installation of the dependencies. So let's create a new virtual environment and install the dependencies.

python -m venv venv  # I'm using Python 3.10.10
source venv/bin/activate

There is several of clients and libraries that supports GGUF format, and probably it will be even more in the future. In this article, we will use ctransformers library, which is a Python library with GPU accel, LangChain support, and it's very easy to use it.

For Macbook with M1 chip, install the following dependency with the following command:

T_CUBLAS=1 pip install ctransformers --no-binary ctransformers

For other operating systems, I'm guessing that you can just run:

pip install ctransformers

Note that I haven't tested this on other operating systems, so if you have any problems, please let me know in the comments.

We also need to download the model from huggingface, so let's do that with the following command:

HUGGINGFACE_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download TheBloke/CodeLlama-7B-Instruct-GGUF codellama-7b-instruct.Q6_K.gguf --local-dir models/ --local-dir-use-symlinks False

This will download the model to the models directory.

That's it, we are ready to generate some code from natural language instructions!
To do that, we will use the following Python code:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(model_path_or_repo_id="models/",
                                           model_file="codellama-7b-instruct.Q6_K.gguf",
                                           model_type="llama")

if __name__ == "__main__":
    print(llm(prompt="Write a python function that takes a string as input and returns the number of "
                   "words in the string."))

My output look like this, but of course, it will be different every time you run it:

# For example, if the input string is "hello world", then your function should return 2 because there are two words in the string.
def count_words(input_string):
    return len(input_string.split())

Voilà, we have generated a Python function that takes a string as input and returns the number of words in the string. Pretty cool, right?

Keep in mind that this is a very simple example, and that Code Llama Instruct model can do much more than this, especially if you use the medium or large version of the model.

I hope you find this article useful.
If you would like to get more content like this directly to your inbox, consider subscribing to my newsletter.

Happy coding, and thanks for reading! 🚀🐍