Pydantic: What It Is and Why It's Useful

Pydantic is a popular open-source Python library for data validation and modeling. It offers tools to define the structure and rules of your data, ensuring its consistency and reliability. Pydantic is looking to have a lot of potential in AI, in regards to data preprocessing and cleaning.

Its ability to validate and serialize data makes it an ideal choice for handling the large and complex datasets often used in AI applications. Additionally, Pydantic’s support for type annotations and type checking can help catch errors early in the development process, making it easier to build and maintain reliable AI systems.

Not just that but Pydantic’s integration with popular AI libraries such as TensorFlow and PyTorch, allows for seamless data manipulation and model training.

Why Use Pydantic

Data Validation

Pydantic enforces data types and constraints you define, catching invalid entries before they cause issues. This is crucial in AI, where incorrect data can lead to biased or inaccurate models.

Data validation is a process that ensures the data entered into a system is correct and useful. It checks the accuracy and quality of data before it’s processed. Here are a few examples of data validation using the Pydantic library in Python:

Type Hints Validation: Pydantic uses Python-type hints to validate data. For instance, in the following code, the Fruit class has attributes name, color, weight, and bazam with specific type hints. Pydantic validates the data against these type hints.

If the data doesn’t match the type hints, a validation error is raised.

from typing import Annotated, Dict, List, Literal, Tuple
from pydantic import BaseModel

class Fruit(BaseModel):
    name: str
    color: Literal['red', 'green']
    weight: Annotated[float, Gt(0)]
    bazam: Dict[str, List[Tuple[int, bool, float]]]

print( 
    Fruit(
        name='Apple', 
        color='red', 
        weight=4.2, 
        bazam={'foobar': [(1, True, 0.1)]}
    )
)

Strict Mode Validation: Pydantic also has a strict mode where types are not coerced, and a validation error is raised unless the input data exactly matches the schema or type hint. Here’s an example:

from datetime import datetime
from pydantic import BaseModel, ValidationError

class Meeting(BaseModel):
    when: datetime
    where: bytes

try:
    m = Meeting.model_validate(
        {'when': '2020-01-01T12:00', 'where': 'home'}, 
        strict=True
    )
except ValidationError as e:
    print(e)

Custom Validators: Pydantic allows for customizing validation via functional validators. For instance, in the following code, a custom validator is used to check if the when attribute is ‘now’ and if so, it returns the current datetime.

from datetime import datetime, timezone
from pydantic import BaseModel, field_validator

class Meeting(BaseModel):
    when: datetime

    @field_validator('when', mode='wrap')
    def when_now(cls, input_value, handler):
        if input_value == 'now':
            return datetime.now()
        when = handler(input_value)
        if when.tzinfo is None:
            when = when.replace(tzinfo=timezone.utc)
        return when

These examples demonstrate how Pydantic can be used for data validation in Python, ensuring that the data being processed matches the expected types and constraints

Data Modeling

Define the structure of your data, including nested objects and relationships. This makes it easier to work with complex data sets and helps keep your code organized.

Serialization/Deserialization

Convert data between different formats like JSON, Python dictionaries, and others. This allows seamless integration with external APIs and data sources.

How Is Pydantic Useful in AI?

One of the burgeoning challenges in the realm of artificial intelligence (AI), particularly when working with Large Language Models (LLMs), is structuring responses. These sophisticated models can generate vast quantities of unstructured data, which then necessitates meticulous organization.

This is where Pydantic, a data validation and settings management library in Python, steps in with an elegant solution.

It simplifies the formidable task by enabling developers to define a clear model for their data, ensuring that the responses from LLMs are well-structured and conform to expected formats.

Guaranteed structure output with Ollama and Pydantic. pic.twitter.com/YF8cFAsaap

— jason (@jxnlco) February 8, 2024

Leveraging Models to Structure Large Language Model Responses

When interfacing with LLMs, it’s crucial to not just receive data but to parse and utilize it effectively. Pydantic facilitates this by allowing the creation of models that serve as blueprints for the expected data.

This means that developers can predefine the structure, types, and requirements of the data they are to handle, making it easier to manage and ensuring that the information is in the correct form for further processing or analysis.

Pydantic 2.7: Optional Support for Incomplete JSON Parsing

The upcoming Pydantic version 2.7 introduces optional support for parsing and validating incomplete JSON, which is particularly beneficial for AI applications. This feature aligns perfectly with the needs of developers processing streamed responses from an LLM.

Instead of waiting for the entire payload, developers can start processing the data as it arrives, enabling real-time data handling and reducing latency in the AI system’s response.

Integration With DSPy and JSON Schemas

Furthermore, there is ongoing experimentation with combining DSPy, Pydantic types, and JSON Schemas to further enhance data validation and transformation capabilities. Such integrations broaden the potential applications of Pydantic in the AI space by leveraging the advantages of each tool, leading to more robust and versatile data-handling solutions.

OpenAI Function Calls and Query Plans

An often-underappreciated aspect of OpenAI’s capabilities is its function-calling feature that permits the generation of entire query plans. These plans can be represented by nested Pydantic objects, adding a structured and executable layer over retrieval and Reading Comprehension Answer Generator (RAG) pipelines.

By adopting this method, developers can obtain plan-and-execute capabilities that allow for handling intricate queries over assorted data sources. An example of this in practice is LlamaIndex, which capitalizes on such a layered approach to accessing and for generating structured data.

Also published here