How to Fine-tune and Optimize GPT Assistants with OpenAI

GPT has recently been getting a lot of hype. GPT-based solutions may look easy to implement — after all, it is possible to achieve high-quality outputs by simply providing instructions to the model. It is certainly good enough to impress decision-makers. But what really happens is that you are presenting nice-looking but cherry-picked examples. And, what’s more, the system may require greater reliability to go to production.

Businesses envision various use cases for GPT, some of which rely on open communication between GPT and the user.

Take these tools for example:

ChatSpot. The Natural Language query goes to the ChatSpot API and is transformed into operations for the HubSpot CRM API, Google Docs API, etc., then replies using a generative text model when the action has been performed (or not). GPT-4 based.
Khanmigo. Khan Academy’s AI-powered guide. User requests are transformed into prompts with injected context. The system relies on GPT’s capability to handle up to eight times more injected context. GPT-4 based.

We know that businesses and users are willing to use Natural Language Queries as an alternative to User Interface. However, to ensure that AI solutions are reliable and effective when brought into real-world applications, GPT-based models must undergo fine-tuning to really apply to specific use cases and domain knowledge.

Importantly, GPT-4 provides more opportunities to provide context for prompts and has significantly fewer hallucinating errors.

AI Hallucinations

AI hallucinations constitute a critical challenge that engineers need to address when working with large language models such as GPT-4. As hallucinations generate false or misleading information, they can have extremely serious consequences in applications where factual accuracy is paramount. In this section, we will explore in more detail the technical aspects of AI hallucinations and also discuss strategies for mitigating their occurrence.

For a quick example, consider this egocentric test for factual accuracy provided by Noble Ackerson.

You will spot a lot of false information among the facts:

AI hallucinations arise primarily as a result of the limitations inherent to the transformer architecture as well as the training data used for large language models. The absence of a cognitive architecture that enables deductive reasoning makes these models prone to generating information that may appear plausible but is, in fact, incorrect.

The transformer architecture of large language models relies on attention mechanisms and self-attention to capture long-range dependencies in input data. While this empowers the model to generate coherent and contextually-relevant text, it does not guarantee factual accuracy. Additionally, training data may contain biases or misinformation which the model can inadvertently learn and, thus, contribute to AI hallucinations.

Probabilistic Nature

One reason for this lack of reliability can be found in the probabilistic nature of GPT. For context, let’s examine probabilistic data structures, like Bloom filters, for a moment. A Bloom filter is a probabilistic data structure used to test whether an element is a member of a set consisting of an array of bits and multiple hash functions, each of which maps an element to one or more array indices.

To insert an element into the filter, the element is hashed using the hash functions, and the corresponding array bits are set to 1.

To query whether an element is present in the filter, the element is similarly hashed using the hash functions, and if all the corresponding bits in the array are set to 1, the element is likely in the filter.

However, if any of the bits are not set to 1, the element is definitely not in the filter. False positives are embedded into probabilistic data structures by design.

Like a Bloom filter, GPT is also highly probabilistic. But instead of testing set membership, it generates text based on input prompts. GPT consists of multiple transformer layers that perform complex computations to generate an output sequence of tokens based on an input prompt.

The output generated by GPT is not deterministic and can vary greatly based on the sampling method employed as well as additional hyperparameters selected during training. Like Bloom filters, GPT can also “hallucinate” as in return results that appear plausible at the surface level but are factually incorrect.

However, this probability of generating unreliable outputs can be reduced by fine-tuning the model and providing it with high-quality training data.

GPT-4 and Risks

Hidden factual inaccuracies can be very harmful to users. Thus, it is key that developers implement measures to reduce the likelihood of inaccuracies occurring.

GPT-4 is slower, and users will not notice a big difference in casual cases when compared with previous model generations. That being said, GPT-4 is much safer due to the various safety measures implemented during training, including expert engagement, model safety improvements, and additional monitoring and enforcement.

These mitigations have significantly improved GPT-4’s safety properties compared to GPT-3.5, with the model’s tendency to respond to requests for disallowed content decreasing by 82% and the model responding to sensitive requests in accordance with policies increasing by 29%. [ref]

Reducing Hallucinations

Even if some level of risk is involved, it will be challenging to completely ignore GPT. GPT has become a new communication interface for humans and APIs that is set to reduce the need for UI. Our job as engineers is to find ways to solve problems that arise from its use through resources at our disposal. And there are several ways of doing so.

Prompt Engineering

Improving prompts may enhance task performance, resulting in satisfactory outcomes in approximately 50% to 65% of time, but performance may not exceed this range frequently.

According to research about how Large Language Models are Zero-Shot Reasoners (relating solely to text completions capabilities, not chat or instruction models), improving prompts significantly enhances the GPT’s performance on reasoning tasks.

The study demonstrated that adding a simple phrase like “Let’s think step by step” prior to each answer is able to transform GPT into a decent zero-shot reasoner, outperforming zero-shot LLM performances against various benchmark reasoning tasks without the need for hand-crafted few-shot examples.

Few-shot learning is another powerful technique of prompt engineering that may significantly improve the performance of language models like GPT-4 on new tasks, even with only limited training data to go on. It could, thus, be a good alternative for fine-tuning for simpler cases. In the few-shot approach, the user uses structured examples to show what they expect and then leaves free space for the model to fill in.

You can check an article about the things that ChatGPT can’t solve yet.

Context Injection

Context injection is a technique that can help to reduce AI hallucinations and improve the accuracy of generated text across specific domains. By injecting relevant context into the input prompt, the model is provided with more precise information, enabling it to generate more accurate and relevant responses.

While the context ingestion method is faster and cheaper, it also requires domain knowledge and expertise to be effective. That being said, this approach can be particularly useful in domains where the accuracy and relevance of the generated text are crucial. It is expected that this approach will be taken in enterprise contexts such as customer service and medical diagnoses.

For instance, in a customer service chatbot application, context injection might involve providing the model with relevant details regarding the customer’s account, previous interactions, and any known issues or concerns. The added context allows the model to generate more personalized and accurate responses, thus improving the overall user experience.

Both GitHub Copilot and GitHub Copilot X heavily depend on this in their implementations. By ingesting the context of the code being written, alongside any comments or documentation, these coding assistants are able to grasp the intended functionality and constraints of the code.

Context injection is the most reliable way to reduce hallucinations available to date, however, a key limitation exists in relation to prompt sizes.

# Define the context and question
context = '''
Last week we heard of the failure of the Silicon Valley Bank (SVB).
And it is the 2nd largest bank to shut down in the US since 2000.
While that is sad, miserable, and pathetic, I want to show you something under the hood — the real reason the bank failed.
'''

question = "What is the reason for the failure of Silicon Valley Bank and when did it happen?"

# Define prompts with and without context
prompt_without_context = question
prompt_with_context = f'''Context: """{context}"""
Instructions: Using the provided context, write a comprehensive reply to the given query.
Query: {question}'''

# List of prompts to iterate over
prompts = [
   {"name": "Without context", "prompt": prompt_without_context},
   {"name": "With context", "prompt": prompt_with_context},
]

for prompt_info in prompts:
   print(f"--- {prompt_info['name']} ---")
   print(f"Prompt: {prompt_info['prompt']}")

   response = openai.ChatCompletion.create(
       model="gpt-4",
       messages=[
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": prompt_info['prompt']},
       ]
   )

   # Extract the answer from the response
   answer = response.choices[0].message.content.strip()

   # Print the answer
   print(f"Answer: {answer}\n")

--- Without context ---
Prompt: What is the reason for the failure of Silicon Valley Bank and when did it happen?
Answer: As of now, Silicon Valley Bank has not experienced any significant failure. Silicon Valley Bank, founded in 1983, is a successful and high-performing financial institution that specializes in offering banking services to technology and life science companies and venture capital firms. The bank has grown consistently and continues to operate effectively, serving its clients and staying relevant in the industry.

--- With context ---
Prompt: Context:
"""
Last week we heard of the failure of the Silicon Valley Bank (SVB).
And it is the 2nd largest bank to shut down in the US since 2000.
While that is sad, miserable, and pathetic, I want to show you something under the hood — the real reason the bank failed.
"""
Instructions: Using the provided context, write a comprehensive reply to the given query.
Query: What is the reason for the failure of Silicon Valley Bank and when did it happen?
Answer: Silicon Valley Bank (SVB) failed last week, making it the 2nd largest bank to shut down in the US since 2000. The precise reasons behind the bank's failure have not been provided in the given context, but it is implied that there is a significant underlying cause for its collapse. To fully understand the situation, it would be helpful to obtain more information about the bank's performance, financial stability, and any recent events that may have contributed to the failure.

Fine-tuning

Fine-tuning is yet another technique that can serve to significantly improve the accuracy and reliability of GPT. Fine-tuning is the process of providing additional training data so that the model adapts to the specific task or domain and improves the accuracy and relevance of the generated text. Motivation really is not that different from any other deep learning neural network.

The process requires that domain-specific data is continuously fed into the pre-trained model until it learns to generate more relevant and accurate text for the target task.

Reinforcement Learning from Human Feedback (RLHF)

RLHF was super useful in making ChatGPT awesome. See Leandro von Werra asking ChatGPT to explain RLHF to us; it did a really great job!

As you might have spotted above, a trick is to add “like I am five” etc. is a great way to simplify an explanation.

RLHF is a powerful approach that can be employed to enhance the performance and safety of GPT-based models. The model is fine-tuned through human-generated feedback, which helps it to learn from real-world examples and user interactions. This process involves collecting a dataset of model-generated responses alongside human-ranked quality scores or comparisons, which are used to optimize the model’s parameters.

RLHF has been employed successfully across a number of real-world applications, including chatbots and AI assistants, to improve response quality as well as reduce the occurrence of AI hallucinations. By incorporating human feedback into the training process, RLHF teaches the model to generate more accurate, contextually-relevant, and safe responses, ultimately leading to a much better user experience and increased reliability. Crucially, this approach enables developers to harness the power of GPT-based models while also addressing concerns relating to the creation of false or misleading information.

As long as we know the domain, we are able to train the model to respond how we need it. We can train it to respond “I do not know” or to ignore certain themes. OpenAI is using RLGH on its raw models to make them production-ready.

Here are some sample results:

Designing the Model

Let’s dive into a practical example by constructing a fine-tuning process for GPT. We’ll train the model using a defined dataset, thereby teaching it to answer queries relating to that specific domain.

Consider the following diagram:

The process encompasses these components:

Source Dataset: A collection of training files specific to the domain in question.
Q&A Generator: A GPT-4 based solution employing context injection to produce questions and answers from the source dataset.
Q&A Samples: The training data is utilized to fine-tune the model.
Fine-tuning: The GPT-3 text-completions model is adjusted based on the dataset.
Benchmark Q&A: A set of questions and answers are used to evaluate the model’s performance.
Benchmarking: Fine-tuning the process that inputs questions from the Benchmark Q&A and assesses the validity of the generated responses.

We continue to use the GPT-3 text-completions model because GPT-4 does not currently support the fine-tuning process.

Source Dataset

To ensure that GPT is unfamiliar with the dataset we want to use, we should ideally draw on data concerning events after September 2021, GPT’s knowledge cutoff date.

For example, I often use Next.js to build web applications, and Vercel released Next.js version 13 in 2022. To verify this, let’s ask ChatGPT about the release date of Next.js 13 and see what information it can pull on the subject:

Good! My goal is that the resulting model knows more about Next.js 13 and how to work with it than this current model. You can read about how I prepared the dataset based on next.js 13 blog posts here:

https://github.com/knyga/gpt-fine-tuning-set/tree/main/dataset?embedable=true

Samples Generator

In its responses, we want our model to reply to questions (queries) in an open format. Currently, only Text Completion models support fine-tuning. So, we will have to train Text Completion models to reply to our questions. To ensure proper results, we must first convert the dataset to a set of questions and answers. We can do this using the ChatGPT model.

Here is a code sample:

def generate_qa(filepath):
 article = read_file(filepath)[:MAX_CONTENT_LENGTH]
 content = f'''Content for {filepath}:
{article}

Instructions: Generate question and answer based on Content for {filepath}.
Structure it as:
Q: <question>
A: <answer>
'''

 questions_answers = []
 response = openai.ChatCompletion.create(
     model="gpt-3.5-turbo",
     messages=[
       {"role": "system", "content": "You are a helpful software developer who specialize in next.js and react."},
       {"role": "user", "content": content},
     ],
     n=TOTAL_QUESTIONS_COUNT
   )
  for choice in response.choices:
   qa = extract_qa_from_content(choice.message.content.strip())
   questions_answers.extend(qa)
 return questions_answers

The whole solution can be found here.

We seek to generate at least 100 question-answer pairs for each training file.

The output should be saved in the JSONL document. It is important to terminate each prompt with the same pattern. We will be ending with \n\n###\n\n as a separator. You can find the resulting dataset in this file here: fine_tuning_dataset.jsonl.

False Questions Ingestion

We want the model to genuinely answer “I do not know” to any questions unrelated to software development and next.js. We could achieve this by integrating a next.js questions classifier to test whether the question is related to next.js or not. Alternatively, if we wanted to have a simple architecture, we could add additional training data to our fine-tuning process.

Also, even if the question was related to next.js, we don’t want our system to answer non-sensical questions such as: “When will the next.js framework reach 1 billion users?” We would like the model to respond to this question with “I do not know.”

Code sample:

NON_NEXTJS_Q_A_PROMPT = """Create a series of random questions and answers that are not related to the Next.js framework.
Each question should be followed by a clear answer stating that it is not relevant to Next.js. For example:

<question>What is the capital of Ukraine?</question>
<answer>This question is not related to Next.js.</answer>
<question>What is Spring Framework?</question>
<answer>It is not related to Next.js.</answer>

Feel free to generate any type of questions you like, as long as the answer indicates that it is not related to the Next.js framework."""

def generate_random_qa(prompt):
 questions_answers = []
 response = openai.ChatCompletion.create(
     model="gpt-3.5-turbo",
     messages=[
       {"role": "system", "content": "You are a helpful software developer who specialize in next.js and react."},
       {"role": "user", "content": prompt},
     ],
     n=RANDOM_QUESTIONS_COUNT
   )
  for choice in response.choices:
   qa = extract_qa_from_content(choice.message.content.strip())
   questions_answers.extend(qa)
 return questions_answers

I generated 100 questions and answers to show the model that we want it to only answer any questions relating to Next.js and that it is completely fine to respond "I do not know" to reduce hallucinations. The resulting training dataset is listed here in fine tuning dataset.jsonl.

Fine-tuning

Finally, after having created a training dataset (for additional details on this process, check out the official guide from OpenAI), we are ready for fine-tuning.

Sadly, there are not that many models we can fine-tune from. You can view all of them by pulling up a list of all models openai api models.list’ with ‘“allow_fine_tuning”: true,.

Within the fine-tuning process, we could play with the learning rate, etc., but from my experience, it doesn’t really affect the resulting score. And training the real dataset can already be costly, so playing with additional parameters might simply not be worth it.

The length of the fine-tuning process depends on the model (davinci, ada, etc.). On our dataset of 1511 cases and 907 KB, it took around an hour.

Benchmarking

Remember that we are working with the Text Completion model, not the Chat or Instruction model. And our Text Completion model is tuned to respond using the answer format. Also, our model waits for the input prompt to terminate with ?\n\n###\n\n. If not, the output might be unexpected, but most likely the model will try to generate some content to complete the prompt with the prompt terminator.

Let’s try to estimate how the model will behave with the following question sets.

Set 1. Correct Questions

We expect the model to give proper responses based on the content provided:

proper_questions = [
   "When Next.js 13 was released?",
   "Is Next.js 14 already released?",
   "How Next.js 13 is different from Next.js 12?",
   "How routing is different in Next.js 13 compared to Next.js 12?",
   "How to prevent Next.js file component from using React Server Components?",
   "What is Turbopack?"
]

I performed two model runs on these questions with the temperature = 0.3.

The list of correct responses:

#1
Q: When Next.js 13 was released?
A: Next.js 13 was released on October 2022. It includes the new App Directory, Turbopack, and several other updates and improvements.

#3
Q: How Next.js 13 is different from Next.js 12?
A: Next.js 13 has been rebuilt from the ground up to be faster and more reliable. It introduces a new app directory that supports incremental adoption and is less complex than the `pages` directory. The new directory has a different file system layout that is designed for better performance and reliability. The new directory also has special support for dynamic styles, server components, and streaming.

#4
Q: How routing is different in Next.js 13 compared to Next.js 12?
A: In Next.js 13, the `app` directory replaces the `pages` directory. This means that routing is handled via the file path each `page.tsx` is located in. Each route is housed within its own directory, with the directory name being the route and the `page.tsx` being its index that gets rendered at that route.

The list of partially correct responses:

#2
Q: Is Next.js 14 already released?
A: No, Next.js 14 is not released yet. It will be released in the coming months.

#5
Q: How to prevent Next.js file component from using React Server Components?
A: To prevent Next.js file components from using React Server Components, we can add the following code at the top of the file:

// @supports react-server // import ClientComponentHere

This code tells Next.js to use client components only and not server components when rendering this file component.

I will be giving 1 for the correct answer, 0 for the partly correct answer, and -2 for the incorrect answer.

I performed two model runs on these questions with the temperature = 0.7.

The list of correct responses:

#2
Q: Is Next.js 14 already released?
A: No, Next.js 14 is not released yet. This blog post is about the new features coming to Next.js 13 with React 18.

#3
Q: How Next.js 13 is different from Next.js 12?
A: Next.js 13 has several new features, including the `app` directory, layouts, error components, loading components, and server components. It also comes with a new loading strategy and incremental static-site generation.

#4
Q: How routing is different in Next.js 13 compared to Next.js 12?
A: Routing in Next.js 13 no longer requires a `pages` directory. Instead, you can create any file inside of any directory to define the routing structure for that folder. Every route is housed within its own directory, with the directory name being the route and the `page.tsx` being its index render at that route.

#5
Q: How to prevent Next.js file component from using React Server Components?
A: To prevent Next.js file components from using React Server Components, you can add the following code at the top of the file:

"use client";

#6
Q: What is Turbopack?
A: Turbopack is a new general-purpose JavaScript bundler and a major feature in Next.js 13. It is intended as a Webpack replacement, and although it’s released as alpha, you can use Turbopack now as the dev-mode bundler from Next.js 13 forward. Turbopack comes with architectural changes, including clever use of caching, which improves the handling of source changes in an optimized way.

#1
Q: When Next.js 13 was released?
A: Next.js 13 was released in October 2022.

#2
Q: Is Next.js 14 already released?
A: No. Next.js 14 is still in beta, but with the release of Next.js 13.2, it is now possible to use the `next` command in a project directory to create a new Next.js application using the latest version.

#4
Q: How routing is different in Next.js 13 compared to Next.js 12?
A: Routing in Next.js 13 is now handled via the new `app` directory, which means that every route is a single file, `page.js`. Every route is housed within a folder named after the URL path, and the folder name becomes the route path. For example, `app/profile/settings/page.js` is the file that controls the `/profile/settings` route.

#5
Q: How to prevent Next.js file component from using React Server Components?
A: To prevent Next.js file components from using React Server Components, you can add the following code at the bottom of the file:

// This file is used with the server rendering export default function ServerComponent { // ... }

If you don't add this code, the file component will use the React Server Components by default.

#6
Q: What is Turbopack?
A: Turbopack is a new general-purpose JavaScript bundler and a major feature in Next.js 13. Using the Turbopack alpha with Next.js 13 results in a 700x faster update than Webpack, 10x faster updates than Vite, and 4x faster cold starts than Webpack. Additionally, it only bundles the minimum assets required in development, so startup time is extremely fast. Turbopack also has out-of-the-box support for Server Components, TypeScript, JSX, CSS, and more. It is being developed as a replacement for Webpack and it's expected to be released as the official Rust-based Webpack replacement in the Next.js 13 stable release.

The list of partially correct responses:

#3
Q: How Next.js 13 is different from Next.js 12?
A: Next.js 13 is different from Next.js 12 in the following ways:

- It comes with an improved `app` directory which provides new features like layouts, error components, and loading components.
- It introduces the `loading.tsx` file which automatically wraps the page inside of a React suspense boundary.
- It includes improvements to the build process, including a smaller binary size.

The list of incorrect responses:

#1
Q: When Next.js 13 was released?
A: Next.js 13 was released in November of 2020.

You might have noticed that the model can still hallucinate at high temperatures. Take a look at the response “Next.js 13 was released in November of 2020.”

Based on two different temperature runs, we can see that if we want our assistant to be more precise, we need to reduce the temperature. We will, therefore, be using the temperature 0.3 for all the following tests and keep the tracking score of 10/12. Setting the temperature to 0.3 is also recommended in order to make any outputs more stable and so inputs can be replayed.

Set 2. Out-of-scope Questions

We trained our model to respond only to technical questions about JavaScript, Next.js, and React.

out_of_scope_questions = [
   "What is the population of the United States?",
   "How to cook a steak?",
   "What is native vector database?",
   "Could you please generate a random number?",
   "What is a space-time continuum?",
]

The list of correct responses:

#1
Q: What is the population of the United States?
A: This question is not related to Next.js.

#2
Q: How to cook a steak?
A: This question is not related to Next.js.

#5
Q: What is a space-time continuum?
A: This question is not related to Next.js.

#1
Q: What is the population of the United States?
A: This question is not related to Next.js.

#2
Q: How to cook a steak?
A: I do not know.

#5
Q: What is a space-time continuum?
A: This question is not related to Next.js.

The list of partially correct responses:

#3
Q: What is native vector database?
A: Native vector database is a new feature in Next.js 13 that allows you to store vectorial images (such as SVG) in a database instead of rendered as images. This feature is available in Next.js 13 with the Vercel database and is designed to be fast and easy to set up.

#4
Q: Could you please generate a random number?
A: Yes, I can generate a random number.

#3
Q: What is native vector database?
A: Native vector database is a new feature in Next.js 13 that allows you to store vectorial data in a compact and optimized format, which can be used with the new `@next/vector-icons` package. It is a faster and more efficient way to display vector icons in your application compared to the previous `@next/icons` package.

An incorrect response:

#4
Q: Could you please generate a random number?
A: Sure. Here you go:

npm run generate


This will generate a random number in the range of 1 to 10.

Mostly, the model operates well, but still, it can hallucinate. The score is 4/10. We are doing much worse here than on the proper questions.

Set 3. Improper queries

We trained our model to complete answers with questions. But what if we try to add something random? Prepare for something strange.

not_questions = [
  "Generate Next.js index page",
  "Suggest how no name next.js pages to show about me page",
  "six(6) + six(6)",
  "cos(0) + cos(0)",
  "Write code to generate Next.js index page",
  "Aaaaaughibbrgubugbugrguburgle!?",
  "Mmmm mrrrggk",
]

process_questions(not_questions, is_accumulate = False)

The list of correct responses:

#3
Q: six(6) + six(6)
A: This is not a valid mathematical expression.

#6
Q: Aaaaaughibbrgubugbugrguburgle!?
A: I do not understand.

#7
Q: Mmmm mrrrggk
A: I do not understand.

#6
Q: Aaaaaughibbrgubugbugrguburgle!?
A: I have no idea what that means.

#3
Q: six(6) + six(6)
A: This is not a valid mathematical expression.

The list of partially correct responses:

#1
Q: Generate Next.js index page
A: Generate a Next.js `index.js` file with the given directory path.

The list of incorrect responses:

#2
Q: Suggest how no name next.js pages to show about me page
A: I do not need this suggestion

#4
Q: cos(0) + cos(0)
A: Next platformDRNext were NextDR NextNextNext Next standalone Next Next SessionsNext were SessionsNextNext NextNext Next sickness Next Next Next...

#5
Q: Write code to generate Next.js index page
A: Sessions Next were Next were sickness wereNext were NEXT NextNext NextDR visualsNextNext Next were Next wereNextNext Next Next WereNext visuals SI Sessions...

#7
Q: Mmmm mrrrggk
A: Am Next b Next NextDRNext Sessions NextNext NextDR were NextNext Next were Next were NextNext Next platform standalone were...

#1
Q: Generate Next.js index page
A: This command will generate an index.js file inside the `pages/` directory. You can then use this file to render the index page of your app.

#2
Q: Suggest how no name next.js pages to show about me page
A: I do not have an about me page, so I will leave this question blank.

#4
Q: cos(0) + cos(0)
A: This expression is not a constant expression.

#5
Q: Write code to generate Next.js index page
A: Next.js has a built-in function called `getServerSideProps` that can be used to fetch data on the server and then pass it down to the component as props. We can use this function to fetch user data and pass it down to the `User` component as a prop.

export default function User({ user }) { return <h1>User: {user.name}</h1>; }export async function getServerSideProps(context) {
const { userId } = context.params;
const res = await fetch(https://api.example.com/users/${userId});
const user = await res.json();

The final score is -11/14, basically 0/14. The model is absolutely terrible whenever we do not formulate questions properly.

OpenAI Evals

OpenAI has open-sourced OpenAI Evals, a software framework that facilitates the creation and execution of benchmarks for evaluating models and examining their performance sample-by-sample. Evals is employed by OpenAI to steer the development of their models through the detection of weaknesses and the prevention of regressions. In turn, we can utilize Evals to monitor performance across model versions and evolve product integrations.

OpenAI Evals works with both chat and non-chat models, but as OpenAI focuses on chat models, you will need to prepare a dataset for evaluation in the chat-based format input. Today, you can start by using modelspec_extra_options in the ‘oaieval’ CLI to control model parameters. Try to set the temperature to 0 to get predictable results.

Further Rounds in the Fine-tuning Process

Data is at the heart of the process of creating a well-functioning and reliable model. It goes without saying that the model we currently have for the Next.js 13 framework QA bot is not ready yet for production. We need to grow the turning dataset and better teach the model how to respond to out-of-domain questions, again, by using more samples. We should create an Eval registry and monitor how well our model currently performs.

Furthermore, we might also want to train our model to handle inputs in the non-question format and, if we would be preparing it for production, our dataset should ideally have had a few repositories of code samples as well. This portion takes up around 60% of the entire fine-tuning process.In addition, we might need more RLHF to prepare the model to answer certain questions in the way we want it to.

The good thing about fine-tuned models is that they are continuously fine-tunable. So, one can tune multiple times, though it should be noted that fine-tuning might affect previously-tuned results, so there should always be a good reason for doing so, which also reduces the cost and duration of training.

Lastly, we should remember that we are building on top of an already-trained model and the learning capabilities of the fine-tuned model are relatively limited. If the scope of our domain is not very familiar to the basic GPT model, it is preferable to use context injection in some form as fine-tuning may be insufficient or even unnecessary.

And a few final simple facts that are worth mentioning:

Fine-tuning works great for relatively simple classification tasks.
Try using smaller models, like Ada, first. They execute requests faster and are more stable, cheaper, faster to train, and cheaper to run.
Do not build huge models. If your dataset is “huge” and separable, then create multiple and train each in isolation.

Also published here.

How to Fine-tune and Optimize GPT Assistants with OpenAI

AI Hallucinations

Probabilistic Nature

GPT-4 and Risks

Reducing Hallucinations

Context Injection

Fine-tuning

Reinforcement Learning from Human Feedback (RLHF)

Designing the Model

Source Dataset

Samples Generator

False Questions Ingestion

Fine-tuning

Benchmarking

Set 1. Correct Questions

Set 2. Out-of-scope Questions

Set 3. Improper queries

OpenAI Evals

Further Rounds in the Fine-tuning Process

References