How to Run Your Own Local LLM (Updated for 2024)

Written by thomascherickal | Published 2024/03/21
Tech Story Tags: llm | local-llm | run-your-own-local-llm | local-large-language-model | open-source-llm | run-a-large-language-model | huggingface-transformers | how-to-run-your-own-local-llm

TLDRThe article provides detailed guides on using Generative AI models like Hugging Face Transformers, gpt4all, Ollama, and localllm locally. Learn how to harness the power of AI for creative applications and innovative solutions.via the TL;DR App

This is the breakout year for Generative AI

Well; to say the very least, this year I’ve been spoilt for choice as to how to run an LLM Model locally.

Let’s start:

1) HuggingFace Transformers:

All Images Created by Bing Image Creator

To run Hugging Face Transformers offline without internet access, follow these steps: Install Transformers with the necessary dependencies while ensuring you don't rely on automatic updates during installation. You can install via pip or conda as described in the Hugging Face documentation

pip install transformers==4.x.y

Load pretrained models from your local machine after downloading them from Hugging Face Hub with internet access. Save the model using save_pretrained(), and then load it later in offline mode.

from transformers import AutoModelForSequenceClassification, BertTokenizerFast

First time, download and save the model

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") 
model.save_pretrained("/my/local/directory/bert-base-uncased")

Later, load the saved model offline

model = AutoModelForSequenceClassification.from_pretrained("/my/local/directory/bert-base-uncased") 
tokenizer = BertTokenizerFast.from_pretrained("/my/local/directory/bert-base-uncased")

Set the environment variables TRANSFORMERS_OFFLINE and HF_DATASETS_OFFLINE to enable offline usage.

export TRANSFORMERS_OFFLINE=1 
export HF_DATASETS_OFFLINE=1

Clone the model repository directly if you prefer not to download individual files.

git clone https://huggingface.co/<username>/<repository> /my/local/directory

Ensure that all necessary files are present in the directory where you plan to execute your scripts. Remember that setting TRANSFORMERS_OFFLINE to 1 alone won't work if the model isn't already available locally. You must either download the model with internet access and save it locally or clone the model repository.


2) gpt4all

gpt4all is an open-source project that allows anyone to access and use powerful AI models like GPT-3. Here are step-by-step instructions for installing and using gpt4all:

  1. Installation:
  • gpt4all requires Python 3.7 or higher and a few Python package dependencies. The easiest way to install gpt4all is via pip:
pip install gpt4all
  1. Get an API Key:
  1. Set up Authentication:
  • Once you have your API key, set an environment variable named OPENAI_API_KEY with your API key as the value.
export OPENAI_API_KEY="YOUR_API_KEY_HERE"
  1. Use gpt4all:
  • Now you're ready to use gpt4all in Python to access GPT-3 and other models. Here's an example prompting Davinci (a version of GPT-3) to summarize a passage of text:
from gpt4all import completions

summary = completions.generate(engine="text-davinci-003", 
                              prompt="Summarize this text: [insert long text here]", 
                              max_tokens=50)
print(summary.text)
  • Refer to the gpt4all documentation for more examples of how to use different models

3) ollamma

Ollamma is an open source library that provides easy access to large language models like GPT-3. Here are the details on its system requirements, installation, and usage:

System Requirements:

  • Python 3.7 or higher
  • Requests library
  • Valid OpenAI API key

Installation:

pip install ollamma

Usage:

  1. Set API key
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"  
  1. Import and create the client
from ollamma import Client

client = Client()
  1. Generate text
response = client.generate(
  engine="text-davinci-003",
  prompt="Hello world", 
  max_tokens=100
)

print(response.generations[0].text)

The generate method allows you to specify the engine, prompt, and parameters like max tokens to configure your request.

You can also fine-tune engines, moderate content, and more. Refer to the Ollamma documentation for additional details on all available methods.

So in summary - Ollamma makes it really easy to leverage GPT-3 and other models in just a few lines of Python code once installed and configured!


4) LM Studio

LM Studio is an open source tool that streamlines the process of training, evaluating, and using state-of-the-art language models locally. Here are the steps to get LM Studio running locally:

  1. Installation:
    LM Studio requires Docker and Docker Compose. To install:

a. Install Docker Engine and Docker Compose on your machine
b. Clone the LM Studio GitHub repository:
git clone https://github.com/lm-studio/lm-studio.git

  1. Configuration:
    LM Studio relies on configuration files under theconfig/ folder. Common files include:

a. model.yaml: specify model architectures
b.training.yaml: set training parameters
c.evaluation.yaml: adjust evaluation settings

  1. Running LM Studio:
    The main interface is via Docker Compose. Some common commands:

a. Build images: docker-compose build
b. Start all services:docker-compose up -d
c. View logs:docker-compose logs -f
d. Stop services:docker-compose down

The services expose various ports you can use to interact with the UI, APIs, notebooks, and more.

So in summary, LM Studio streamlines local language model experimentation. Users just need to clone the repo, configure settings files and use simple Docker commands to start training, evaluating, and using models.

5) localllm

I find that this is the most convenient and simple way of all. The full explanation is given on the link below:

https://cloud.google.com/blog/products/application-development/new-localllm-lets-you-develop-gen-ai-apps-locally-without-gpus

Summarized:

localllm combined with Cloud Workstations revolutionizes AI-driven application development by letting you use LLMs locally on CPU and memory within the Google Cloud environment. By eliminating the need for GPUs, you can overcome the challenges posed by GPU scarcity and unlock the full potential of LLMs. With enhanced productivity, cost efficiency, and improved data security, localllm lets you build innovative applications with ease

6) llamma.cpp

To install and use Llama.cpp for local training and inference, follow these steps:

Install dependencies:

Python 3 
CMake Optional (for GPU acceleration) 
NVIDIA drivers, 
CUDA, and cuDNN For Windows, 
use Visual Studio Community with Desktop C++ Environment and Python 3 

Clone the repository:

git clone --recursive https://github.com/abetlen/llama-cpp-python.git If you want to use GPU acceleration, set the environment variable as described in the repo (for example, on Linux):

export CMAKE_ARGS="-DLLAMA_CUBLAS=ON" Install Llamma.cpp:

For local build and system compilation:

cd llama-cpp-python pip install -e . 

For Windows, or if you want prebuilt binaries, consider the following: 

No GPU support: pip install llama-cpp-python[server]

Optionally, to use a high-level interface, using the command:

python -m llama_cpp.server --model models/7B/llama-model.gguf 
With GPU support: set FORCE_CMAKE=1 set CMAKE_ARGS=-DLLAMA_CUBLAS=ON 
pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

Optional, for Windows, to use a high-level interface:

python -m llama_cpp.server --model "path/to/your/model"
 Download and place Llama models in the models/ subdirectory within the Llama.cpp repository
 if they are not included 
(models like llama-model.gguf or ggml-* files are available on Hugging Face or GGML).

Test the installation:

The llama_cpp/llama_cpp.py script provides a simple high-level interface in Python. 

Replace llama-model.gguf with your downloaded Llama model to test inference. 

If you want to train your own Llama model, you'll need to train it according to the official documentation 

and manually convert the GGML files into GGUF files (for GPU deployment).

7) Oobagooba

This is an open source gradio implementation of a webui to generate text using langchain.

Some excerpts from the README.md file:

text-generation-webui-extensions

This is a directory of extensions for https://github.com/oobabooga/text-generation-webui

If you create your own extension, you are welcome to submit it to this list in a PR.

long_term_memory

A sophisticated extension that creates a long term memory for bots in chat mode.

https://github.com/wawawario2/long_term_memory

AllTalk TTS

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features.

  • Custom Start-up Settings: Adjust your default start-up settings. Screenshot
  • Narrarator: Use different voices for main character and narration. Example Narration
  • Low VRAM mode: Great for people with small GPU memory or if your VRAM is filled by your LLM. Screenshot
  • DeepSpeed: A 3-4x performance boost generating TTS. DeepSpeed Windows/Linux Instructions Screenshot
  • Local/Custom models: Use any of the XTTSv2 models (API Local and XTTSv2 Local).
  • Optional wav file maintenance: Configurable deletion of old output wav files. Screenshot
  • Finetuning Train the model specifically on a voice of your choosing for better reproduction.
  • Documentation: Fully documented with a built in webpage. Screenshot
  • Console output Clear command line output for any warnings or issues.
  • API Suite and 3rd Party support via JSON calls Can be used with 3rd party applications via JSON calls.
  • Can be run as a standalone app Not just inside of text-generation-webui.

https://github.com/erew123/alltalk_tts

EdgeGPT

Extension for Text Generation Webui based on EdgeGPT by acheong08, for a quick Internet access for your bot.

https://github.com/GiusTex/EdgeGPT

XTTSv2

A variant of the coqui_tts extension in the main repository. Both use the XTTSv2 model, but this one has a "narrator" feature for text written *between asterisks*.

https://github.com/kanttouchthis/text_generation_webui_xtts

Playground for Writers

This extension provides an independent advanced notebook that will be always present from the top tab. It has many features not found in the notebook:

  • Two independent Notebooks A and B that are always present, regardless of the mode
  • Inline instruct (abilty to ask question or give task from within the text itself)
  • Select and Insert - generate text in the middle of your text
  • Perma Memory, Summarization, Paraphrasing
  • LoRA-Rama - shows LoRA checkpoints and ability to switch between them
  • LoRA scaling (experimental) - adjust LoRA impact using a sclider

https://github.com/FartyPants/Playground

And there us so much more to explore: Check out:

https://github.com/oobabooga/text-generation-webui-extensions

Conclusion?

And there’s more! So much more! LangChain, llm, ollamma, the list just keeps getting bigger and bigger!

Here’s to a glorious year of beautiful creativity. Cheers!


Written by thomascherickal | Multi-domain specialist and independent research scientist: https://thomascherickal.com & https://thomascherickal.net
Published by HackerNoon on 2024/03/21