GPT-Like LLM With No GPU? Yes! Legislation Analysis With LLMWare

Introduction

The Race to Commercialize LLMs

Nearly one year after ChatGPT took the world by storm, LLMs have entered a new phase of development. Where the months immediately following the release were characterized by a rush to build and release foundational technologies, the focus of development is now on commercialization.

Organizations are no longer rushing to incorporate AI into their processes, but rather taking a step back and asking themselves, what can AI do for us that is actually an improvement to what we have now?

Barriers to Commercialization

Often, that question is a surprisingly difficult one to answer. This is because, when evaluated from a commercial standpoint, LLMs still have many obstacles to general adoption:

GPU Availability: Whether they are calculated directly like in AWS SageMaker or packaged as a per-token cost like with OpenAI, GPUs are ultimately a scarce commodity, and compute-hungry LLMs can be quite expensive to run. Costs from running exceptionally large models like GPT4, for example, can quickly add up, especially when you’re embedding large contexts.
Scalability: Related to the GPU cost is the lack of scalability. Most enterprise use cases require repeating a process thousands, or even millions of times. It is very difficult for organizations to deploy a new process that can cost millions of dollars, especially if there is already an adequate existing solution in place.
Data Security: Since most organizations lack the MLOps knowledge or the resources to run their own GPU servers, most LLM queries are sent off-site for processing. This creates an unacceptable security concern for many firms. Services like OpenAI Enterprise alleviate some of these concerns, but it is only a matter of time before a breach of some sort shatters trust in such services.

AI Miniaturization

With this backdrop, a new wave of AI pioneers have begun exploring the idea of miniaturizing LLMs. Where GPT-4 increased their parameter count 10-fold vs GPT-3.5 to boost performance, miniaturization pioneers try to keep performance constant while reducing parameter count by 50 to 100 fold.

If successful, a miniaturized LLM has the potential to overcome every adoption obstacle at once. By being small, the LLM can be run without a GPU and consequently becomes much more scalable. A sufficiently small model can also be run on internal infrastructure, allowing firms better control over their data.

In today’s article, we will explore several newly released technologies from LLMWare, a company that is revolutionizing the AI space by developing open-source infrastructure for LLM applications. Specifically, we will solve a real-world business problem using LLMWare’s open-source software and miniature models to demonstrate the framework’s capabilities.

What’s more, all computation in today’s article is done on a M1 Max 32 GB RAM MacBook Pro, and at no point will data leave the laptop for processing. This setup is fairly powerful for a laptop, but it is relatively easy to replicate with cloud infrastructure, so it should serve as a good proof-of-concept for the promise of AI miniaturization.

The use case we will explore today will be based on public data, but as is the case with all of my articles, the techniques can be cross-applied to a host of domains with proprietary data, including science, medicine, and finance.

Without further ado, let’s dive into the use case today.

Business Problem

In today’s example, we will build a legislation analysis LLM. Often, in the course of financial or legal research, one has to search through large volumes of legal and legislative text looking for specific ideas.

While many analysts currently do this with pure text search (also known as control-f in sophisticated circles), this can be time-consuming since one has to go through many related terms and read quite a lot of text around the found term to determine if the idea is relevant.

Some legislation documents are also not searchable, introducing another complication.

We will solve this problem with an AI-powered legal analyzer. This analyzer will allow us to quickly find the relevant bits of text using a technique called semantic search, which can find not only the exact phrase, but also sections that are semantically similar to the search term.

The result of the search is then fed to an LLM for further summarization and analysis.

What’s more, the application will be able to do this in under a minute, representing a dramatic improvement over manual text search.

To test our application, we will search for and summarize the definition of “qualified opportunity zone partnership interest”, which is a term defined in the Tax Cuts and Jobs Act of 2017.

The application will have access to the full text of the Tax Cuts and Jobs Act, but to make the search portion of this exercise more challenging, we will increase the search space by providing it with the text of several other lengthy legislations.

The application will not know that the definition is in the Tax Cuts and Jobs Act, and therefore, will have to search through the entire corpus to complete the query. For today’s example, we have downloaded the PDF versions of Dodd-Frank, Inflation Reduction Act, and the National Environmental Policy Act in addition to the Tax Cuts and Jobs Act.

Now that we understand the problem we’re trying to solve, let’s take a look at the technologies we will use to build this application.

Technologies Used

LLMWare

LLMWare is an open-source framework that makes it incredibly easy to build powerful, extensible RAG applications. There are several frameworks on the market such as LangChain and LlamaIndex that aim to solve similar problems, but there are several factors that set LLMWare apart:

First-class support for local LLM applications: Most LLM frameworks are built around high-powered models like OpenAI’s GPT models, and therefore, end up being specialized frameworks for orchestrating API calls. LLMWare, on the other hand, makes it easy to run your entire LLM infrastructure locally, which makes these types of experiments much easier.
Exceptional extensibility: LLMWare’s abstractions are composable and make use of Python’s built-in data structures. This makes it really easy to pick and choose components you want to use and add additional processing without having to delve deep into the framework code.
A large number of useful utilities with intuitive abstractions: Beyond the orchestration and prompt engineering utilities, LLMWare comes with its own PDF parser, OCR engine, legal metadata processing engine, and data storage utilities. These utilities are also managed in an intuitive way, allowing a user to leverage them with very few lines of code.
Integrations with LLMWare’s embedding and language models: LLMWare makes it really easy to download and run their embedding and language models, which are optimized for legal text corpora.

Retrieval-Augmented Generation

Retrieval-Augmented Generation is the centerpiece of this application, since the pattern allows us to implement several key design features that help a small language model perform better.

Whereas many vanilla RAG implementations simply cut the source data into large chunks and use a general-purpose embedding model like OpenAI’s ada, we apply the following optimizations.

Specialized semantic search: Instead of using a general-purpose embedding model, we use a specialized embedding model, in this case, the “industry-bert-contracts” model. This model is especially trained on legal text, which increases the accuracy of the search.
Prompt Compression: We cut the source data into smaller chunks so that we have finer control over the length of the prompt. One of the most glaring limitations of small language models is the inability to comprehend long prompts, so it is important we only embed the most relevant portions of the source material in the prompt as context.
Context maximization: LLMWare’s parsing libraries are optimized for legal content, and as such can intelligently chunk the source data. This allows us to maximize the amount of relevant content in the prompt, greatly boosting the quality of the output.

BLING

BLING is a collection of LLMWare’s open-source, miniaturized, CPU-powered language models. These are based on open-source LLMs ranging between 1 billion to 3 billion parameters. Despite their exceptionally small size, they are trained specifically for the legal text summarization and extraction use case and can perform satisfactorily when the prompts are constructed to fit this use case.

As great as these models are, they are not magical, and the performance will degrade if we push the models too hard. Therefore, care needs to be taken in the preceding steps in order to maintain the quality of the final output.

Now that we understand the technology behind our application, let’s get coding!

Tutorial

Infrastructure Setup

The LLMWare framework is very easy to set up. Let’s first install the Python libraries. To run LLMWare, you only need to install two packages, and the dependencies will be installed automatically. Simply run:

pip install llmware transformers

Please note that LLMWare as of the time of this writing does not support Python 3.12, so if you run into installation issues, please create a virtual environment based on Python 3.10.

Now that the Python libraries are installed, let’s install the other infrastructure pieces, like the Milvus vector database and MongoDB document database. LLMWare has made this extremely easy, so the only piece of software you need to install is docker.

On MacOS, you will want to download and install Docker Desktop, and on other operating systems, you will want to follow the prevailing best practices. After installation, you should be able to use docker in your command line.

Once you confirm that docker is installed, you can run the docker-compose file from LLMWare, and everything will be set up for you. Simply run the following.

curl -o docker-compose.yaml https://raw.githubusercontent.com/llmware-ai/llmware/main/docker-compose.yaml
docker compose up -d

If the docker was installed properly, you should see the docker containers spinning up. Now, everything should be set up. To test your setup, you can run a truncated version of the “quickstart” script from LLMWare’s GitHub page.

import os
from llmware.library import Library
from llmware.retrieval import Query
from llmware.prompts import Prompt
from llmware.setup import Setup

library = Library().create_new_library("Agreements")
sample_files_path = Setup().load_sample_files()
library.add_files(os.path.join(sample_files_path,"Agreements"))

library.install_new_embedding(embedding_model_name="industry-bert-contracts", vector_db="milvus")

os.environ["TOKENIZERS_PARALLELISM"] = "false" # Avoid a HuggingFace tokenizer warning
query_results = Query(library).semantic_query("Termination", result_count=20)

print(query_results)

If you see a bunch of query results coming back from LLMWare’s semantic query, your machine is fully set up for building the application!

Indexing and Creating a Vector Database of Legislations

The first thing we want to do is to create a vector database over the text of the legislation we’re analyzing today. This will help us quickly perform semantic queries down the line. If we were doing this without LLMWare, we would have to perform OCR on the PDF documents, extract the metadata, chunk the corpus, and run an embedding model on the resulting chunks.

Thankfully, with LLMWare, this is all very simple. Simply dump the PDFs of the legislations into a folder and run:

import os
import time
from llmware.library import Library
from llmware.retrieval import Query
from llmware.prompts import Prompt
from llmware.setup import Setup


library = Library().create_new_library("Tax_Cuts_Jobs_Act")
library.add_files([Your folder])
library.install_new_embedding(embedding_model_name="industry-bert-contracts", vector_db="milvus")

If this completes without issue, then you have created an embedding vector database! If you ran this in a Jupyter Notebook, then you may also see an output indicating that thousands of embeddings were created.

Now, we can perform a semantic query on the vector database. For the purpose of this application, we can simply perform a query using the question “What is a qualified opportunity zone partnership interest?” To do this, we can simply run:

query_results = Query(library).semantic_query("What is a qualified opportunity zone partnership interest?", result_count=2)

The query should take a couple of seconds at most. Print the results, and you should see the 2 most relevant sections retrieved by the semantic search. Inspect the metadata, and you should see that the search found the relevant sections within the Tax Cuts and Jobs Act.

Now that we have retrieved the source data, let’s feed it to LLMWare’s BLING models and see how they perform!

Downloading and Running the LLMWare BLING models

Because the LLMWare BLING models are hosted on Hugging Face, it’s actually very easy to download and run the models. However, since BLING models require a specific structure for their prompts, it is easier to manage the downloading and running of BLING models through LLMWare’s Prompt object.

To run the models, we theoretically only need to load and run the model like so:

prompter = Prompt().load_model(model_name, from_hf=True, api_key="")
output = prompter.prompt_main(query, context=embedded_text
                                 , prompt_name="default_with_context",temperature=0.0)

However, in order to get the best results, we will need to do some pre-processing with the semantic query results. Also, since we want to evaluate all of the LLMWare’s various BLING models, we will want to enumerate the models and output the results.

To do this, we will add some wrappers to the prompt calls like so:

embedded_text = ''
for qr in query_results:
   embedded_text += '\n'.join(qr['text'].split("\'\'"))


# check all of the models for performance

model_list = ["llmware/bling-1b-0.1",
             "llmware/bling-1.4b-0.1",
             "llmware/bling-falcon-1b-0.1",
             "llmware/bling-cerebras-1.3b-0.1",
             "llmware/bling-sheared-llama-1.3b-0.1",
             "llmware/bling-sheared-llama-2.7b-0.1",
             "llmware/bling-red-pajamas-3b-0.1",
             ]


# adapted from the BLING demo
query = "What is the definition of qualified opportunity zone partnership interest?"
for model_name in model_list:
    t0 = time.time()
    print(f"\n > Loading Model: {model_name}...")
    prompter = Prompt().load_model(model_name, from_hf=True, api_key="")
    
    t1 = time.time()
    print(f"\n > Model {model_name} load time: {t1-t0} seconds")
    
    print(f"Query: {query}")
    output = prompter.prompt_main(query, context=embedded_text
                                 , prompt_name="default_with_context",temperature=0.0)
    
    llm_response = output["llm_response"].strip("\n")
    print(f"LLM Response: {llm_response}")
    print(f"LLM Usage: {output['usage']}")
    
    t2 = time.time()
    print(f"\nTotal processing time: {t2-t1} seconds")

Run this code, and watch the results get printed to the console. When I ran this code, I got the following results:

> Loading Model: llmware/bling-1b-0.1...

 > Model llmware/bling-1b-0.1 load time: 10.456622123718262 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response:     (i) In general, qualified opportunity zone partnership interest means any capital or profits interest in a domestic partnership if-  owned by the qualified opportunity  fund solely in exchange for cash, and is acquired by the qualified opportunity  fund after December 31, 2017.
LLM Usage: {'input': 611, 'output': 54, 'total': 665, 'metric': 'tokens', 'processing_time': 9.11120319366455}

Total processing time: 9.115610837936401 seconds

 > Loading Model: llmware/bling-1.4b-0.1...

 > Model llmware/bling-1.4b-0.1 load time: 6.793190956115723 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response:   Qualified opportunity zone partnership interest is any capital or profits interest in a domestic partnership if the partnership is acquired by the qualified opportunity fund after December 31, 2017, from the partner-ship solely in exchange for cash.
LLM Usage: {'input': 611, 'output': 48, 'total': 659, 'metric': 'tokens', 'processing_time': 12.356989860534668}

Total processing time: 12.36338210105896 seconds

 > Loading Model: llmware/bling-falcon-1b-0.1...

 > Model llmware/bling-falcon-1b-0.1 load time: 6.4988627433776855 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: •Section 1202(c)(3) definition.
LLM Usage: {'input': 645, 'output': 13, 'total': 658, 'metric': 'tokens', 'processing_time': 5.218444108963013}

Total processing time: 5.2250282764434814 seconds

 > Loading Model: llmware/bling-cerebras-1.3b-0.1...

 > Model llmware/bling-cerebras-1.3b-0.1 load time: 15.7601318359375 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: (B) QUALIFIED OPPORTUNITY ZONE STOCK
LLM Usage: {'input': 645, 'output': 17, 'total': 662, 'metric': 'tokens', 'processing_time': 5.469969987869263}

Total processing time: 5.475223064422607 seconds

 > Loading Model: llmware/bling-sheared-llama-1.3b-0.1...

 > Model llmware/bling-sheared-llama-1.3b-0.1 load time: 5.345098972320557 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: (I) such partnership was a qualified opportunity zone busi-  ness (or, in the case of a new partnership, such partner-  ship was being organized for purposes of being a qualified opportunity zone busi-  ness) and
during substantially all of the qualified opportunity fund's holding period for such interest, such partnership qualified as a qualified opportunity zone busi-  ness.
LLM Usage: {'input': 681, 'output': 95, 'total': 776, 'metric': 'tokens', 'processing_time': 23.642255067825317}

Total processing time: 23.648069858551025 seconds

 > Loading Model: llmware/bling-sheared-llama-2.7b-0.1...

 > Model llmware/bling-sheared-llama-2.7b-0.1 load time: 15.780867099761963 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: • Any capital or profits interest in a domestic partnership if:
•Such interest is acquired by the qualified opportunity fund after December 31, 2017, from the partnership solely in exchange for cash
•As of the time such interest was acquired, such partnership was a qualified opportunity zone business (or, in the case of a new partnership, such partnership was being organized for purposes of being a qualified opportunity zone business), and
•During substantially all of the qualified opportunity fund's holding period for such interest, such partnership qualified as a qualified opportunity zone business.
LLM Usage: {'input': 681, 'output': 136, 'total': 817, 'metric': 'tokens', 'processing_time': 63.21877098083496}

Total processing time: 63.23019576072693 seconds

 > Loading Model: llmware/bling-red-pajamas-3b-0.1...

 > Model llmware/bling-red-pajamas-3b-0.1 load time: 17.25023102760315 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: 1.   Qualified opportunity zone business property is tangible property used in a trade or business of the qualified opportunity fund if:
1.1.  The original use of the property started with the qualified opportunity fund;
1.2.  The property was substantially improved by the qualified opportunity fund; and
1.3.  The property is qualified opportunity zone stock.
LLM Usage: {'input': 611, 'output': 77, 'total': 688, 'metric': 'tokens', 'processing_time': 38.12269997596741}

Total processing time: 38.1325581073761 seconds

 > Loading Model: llmware/bling-stable-lm-3b-4e1t-v0...

 > Model llmware/bling-stable-lm-3b-4e1t-v0 load time: 23.456523895263672 seconds
Query: What is the definition of qualified opportunity zone partnership interest?
LLM Response: "Qualified opportunity zone partnership interest" means any capital or profits interest in a domestic partnership if-
LLM Usage: {'input': 611, 'output': 21, 'total': 632, 'metric': 'tokens', 'processing_time': 14.21246600151062}

Total processing time: 14.222395181655884 seconds

The process should only take a few minutes. Now that we have some outputs from the model, let’s evaluate them!

Evaluating the Output vs GPT 3.5

If you read the output closely, you will see that not every model produced good results, and this is to be expected. Some of the models, especially the small ones, will struggle with the length of the input and produce inaccurate or nonsensical outputs.

Other models were confused by the formatting of the source text and produced relatively sensible, but strangely formatted outputs.

However, a couple of models, especially the Sheared LLaMa models, performed quite well in summarizing the context and answering the questions. Compare this to the answer from GPT3.5, for example, and you will see that the answer is quite satisfactory. For me, the GPT3.5 response (through the API, not chat) was:

A qualified opportunity zone partnership interest refers to an investment in a partnership that operates within a designated opportunity zone. An opportunity zone is a specific geographic area in the United States that has been identified as economically distressed and in need of investment. The partnership interest must meet certain criteria and be held for a specific period of time in order to qualify for certain tax benefits, such as deferral or reduction of capital gains taxes.

That is a much less informative answer than what we got from Sheared LLaMa, but Sheared LLaMa had the benefit of an embedded context. If we embed the found text into the GPT3.5 query, we will get something like the following:

The definition of qualified opportunity zone partnership interest is any capital or profits interest in a domestic partnership that meets the following criteria:
(i) The interest is acquired by the qualified opportunity fund after December 31, 2017, from the partnership solely in exchange for cash.
(ii) At the time of acquisition, the partnership was a qualified opportunity zone business or was being organized for the purpose of being a qualified opportunity zone business.
(iii) During substantially all of the qualified opportunity fund's holding period for the interest, the partnership qualified as a qualified opportunity zone business.

Which, as you can see, is quite comparable to the answer given by the 2.7 billion parameter Sheared LLaMa. Of course, this is a great win for Sheared LLaMa, since it is 1/50th the size of GPT3.5, and can be run on a laptop!

With that, we have built a legislation analyzer using LLMWare and BLING!

Conclusion

In today’s example, we implemented a special implementation of the RAG architecture using LLMWare’s open-source framework and open-source models. By intelligently parsing the data and applying some preprocessing, we were able to achieve very competitive performance using only our laptop.

Miniaturization represents a revolutionary sea-change in the race to commercialize LLM technology. With LLMWare’s local-first approach, firms are able to build massively scalable and specialized LLM applications that can solve real-world problems without having to contend with the constraints of GPU availability and compute cost.

Legislation analysis is just one of the many business problems that can be solved with LLM - there are a lot of analogies between legislation search and scientific bibliography search, financial analysis, or medical records search.

If you would like to explore these or any other use cases, please do not hesitate to get in touch on LinkedIn or Github. You can find LLMWare’s RAG framework on GitHub and the BLING models, embedding models, and testing datasets on Hugging Face.

By the way, if you are interested in using this particular LLM technology, but don’t want to build your own, we are building our own GUI wrapper. If you would like to be notified about this product, please get in touch as well.