PrivateGPT: ChatGPT but Private and Compliant

Privacy is a top concern when discussing ChatGPT-like tools with professionals.

Data Privacy:
- Querying and transmitting data to external servers may compromise data privacy.
- Sensitive information transmitted over the internet can be exposed to unauthorized access.
Data Security:
- Sending data to external servers increases the risk of unauthorized access.
- Storing sensitive information outside the user's controlled environment can be problematic.
Regulatory Compliance:
- Compliance requirements for data handling and storage vary by industry and region.
- Using ChatGPT may raise concerns about compliance with these regulations.
Control over Data:
- Users have limited control over how their data is processed and stored with ChatGPT.
- Ensuring compliance with data handling policies or organizational requirements can take time and effort.
Legal and Ethical Considerations:
- Jurisdiction-specific legal and ethical considerations may apply to using language models.
- Specific AI models may require user consent or compliance with regulations for interactions or decision-making.

On May 2nd, 2023, Iván Martínez showed how open-source models and tools like @LangChainAI enable a 100% local execution, ensuring your data never leaves your environment.

Enables local execution of language models, keeping data within the user's environment.
Enhances data privacy, security, and control, reducing compliance risks associated with traditional ChatGPT-like tools.

privateGPT allows you to interact with language models (such as LLMs, which stands for "Large Language Models") without requiring an internet connection.

With privateGPT, you can work with your documents by asking questions and receiving answers using the capabilities of these language models.

One of the critical features emphasized in the statement is the privacy aspect.

privateGPT ensures that none of your data leaves the environment in which it is executed.

This means that all interactions, including document ingestion and question-answering, occur within your execution environment without transmitting data over the Internet.

This focus on privacy aims to protect your sensitive information and maintain confidentiality.

To summarize, privateGPT enables offline usage of language models, allowing you to interact with your documents by asking questions and obtaining responses. It emphasizes data privacy by ensuring that your data remains within your execution environment and is not transmitted over the Internet.

System Requirements:

Python Version
- To use this software, you must have Python 3.10 or later installed. Earlier versions of Python will not compile
C++ Compiler
- If you encounter an error while building a wheel during the pip install process, you may need to install a C++ compiler on your computer.
- For Windows 10/11
  - To install a C++ compiler on Windows 10/11, follow these steps:
  - Install Visual Studio 2022.
  - Make sure the following components are selected:
    - Universal Windows Platform development
    - C++ CMake tools for Windows
    - Download the MinGW installer from the MinGW website.
    - Run the installer and select the "gcc" component.

Environment Setup

To set up your environment and run the code provided, follow the steps below:

Step 1: Install Requirements

Make sure you have Python and pip installed on your system. Then, open a terminal or command prompt and navigate to the project directory. Run the following command to install the required dependencies:

pip install -r requirements.txt

This command will automatically install all the necessary packages and libraries for running the code.

Step 2: Download the Models

Next, you need to download the two models required for the code. These models are the LLM (GPT4All-J compatible model) and the Embedding model. Follow the instructions below to download and place them in a directory of your choice:

LLM Model: Download the LLM model compatible with GPT4All-J. The default model is named "ggml-gpt4all-j-v1.3-groovy.bin". If you prefer a different GPT4All-J compatible model, you can download it from a reliable source. Once downloaded, place the model file in a directory of your choice.

Embedding Model: Download the Embedding model compatible with the code. The default model is named "ggml-model-q4_0.bin". If you prefer a different compatible Embeddings model, download it and save it in the same directory as the LLM model.

Step 3: Rename and Edit the Configuration File

In the project directory, locate the file named "example.env". Rename this file to ".env" (with a dot at the beginning). This file contains the configuration variables that need to be set appropriately.

Open the ".env" file in a text editor and modify the following variables according to your setup:

MODEL_TYPE: Set this variable to either "LlamaCpp" or "GPT4All", depending on the type of model you want to use.
PERSIST_DIRECTORY: Specify the folder where you want your vectorstore to be saved. Provide the absolute path to the desired directory.
LLAMA_EMBEDDINGS_MODEL: Set this variable to the absolute path of your LlamaCpp-supported embeddings model binary. Make sure to specify the full path, including the filename and extension.
MODEL_PATH: Set this variable to the path of your GPT4All or LlamaCpp-supported LLM model. Again, provide the full path including the filename and extension.
MODEL_N_CTX: Specify the maximum token limit for both the embeddings and LLM models. This determines the length of the input text that the models can handle. Set an appropriate value based on your requirements.

Note: When specifying the LLAMA embeddings model path in the LLAMA_EMBEDDINGS_MODEL variable, make sure to provide the absolute path. Home directory shortcuts like "~/path/to/model" or "$HOME/path/to/model" will not work.

Save the ".env" file after making the necessary changes.

Step 4: Run the Code

Once you have completed the steps above, you are ready to run the code with your customized environment settings. Execute the code using your preferred method, such as running a script or using an integrated development environment (IDE).

Ensure that your code reads the environment variables from the ".env" file at the beginning of the execution. This will ensure that the correct model paths and settings are used.

Ingesting Your Own Dataset

To ingest your own dataset into the local vectorstore using the provided code, follow the instructions below:

Step 1: Prepare Your Dataset

Gather all the documents you want to ingest into the vectorstore. The supported file formats are .txt, .pdf, and .csv. Ensure that your documents are properly formatted and contain the relevant text data.

Step 2: Organize Your Files

Create a directory named "source_documents" in the same directory as the code files. This directory will hold all your dataset files. Place all the .txt, .pdf, or .csv files into the "source_documents" directory.

Step 3: Run the Ingestion Command

Open a terminal or command prompt and navigate to the directory where the code files are located. Use the following command to run the ingestion process:

python ingest.py

This command will initiate the ingestion process and start processing your dataset files. Depending on the size of your documents, the process may take some time to complete. Please be patient and allow the command to finish execution.

Step 4: Wait for Completion

The ingestion process will create a folder named "db" in the same directory as the code files. This folder will contain the local vectorstore, where your ingested documents' embeddings are stored. The time taken for the ingestion process depends on the size of your documents and the processing power of your system.

During the ingestion process, no data leaves your local environment. The ingestion is performed entirely on your machine, and you can even perform it without an internet connection.

Step 5: Use the Ingested Data

Once the ingestion process is complete, you can start using the ingested data for various tasks such as similarity search or text analysis. The local vectorstore contains the accumulated embeddings of all the ingested documents, allowing you to perform operations on your dataset efficiently.

Note: If you want to start with an empty database and remove all previously ingested data, simply delete the "db" folder before running the ingestion command again.

Locally Querying Your Documents

To ask questions to your documents locally, follow these steps:

Run the command:

python privateGPT.py

Enter your query when prompted and press Enter. Wait for the script to process the query and generate an answer (approximately 20-30 seconds).
The script will display the generated answer and the four context sources used from your documents. You can ask more questions without re-running the script.
To exit the script, type "exit" when prompted for a query.

Note: The script works offline, and no data leaves your local environment.

How does it work?

The process involves selecting appropriate local models and utilizing the power of LangChain. By doing so, you can perform the entire pipeline within your own environment, without any data leaving it, while maintaining reasonable performance.

Firstly, the "ingest.py" script utilizes LangChain tools to analyze the document and generate embeddings (representations) of the text.

This is done locally using LlamaCppEmbeddings. The resulting embeddings are then stored in a local vector database using Chroma vector store.

Next, the "privateGPT.py" script uses a local language model (LLM) based on either GPT4All-J or LlamaCpp. It uses this model to comprehend questions and generate answers.

To provide context for the answers, the script extracts relevant information from the local vector database. It achieves this by performing a similarity search, which helps locate the appropriate piece of context from the stored documents.

Additionally, it's worth noting that the GPT4All-J wrapper was introduced in LangChain version 0.0.162, facilitating the integration of GPT4All-J within the LangChain framework.

Overall, this approach allows you to run the entire process locally, without sharing your data, and achieve satisfactory performance by leveraging local models and the capabilities of LangChain.

Here is the link to the repo: https://github.com/imartinez/privateGPT

Also published here.