johnsosoka.com

Weekend Project: Dynamic DNS with AWS Lambda & Jenkins

2025-03-02T00:00:00-07:00

A few months back, in December, I purchased a Unifi Ubiquity Dream Machine (UDM). This was a significant upgrade from my previous router, and enabled me to segment my home network into multiple VLANs. I typically do not like to expose services from my home network, but with a completely segmented network, I felt comfortable exposing a few services. The first service I wanted to expose publicly was a vanilla Java Minecraft server. I already had a domain name (johnsosoka.com) and wanted to set up dynamic DNS for minecraft.johnsosoka.com.

The Game Plan

My personal website is hosted on AWS, and I already have a bit of infrastructure in place, including API Gateway and a few Lambda functions. Furthermore, I have a server running Jenkins in my home network. Here’s a diagram for the project planned for today:

The Diagram above demonstrates the flow of the dynamic DNS service, the home network diagram is simplified

The plan is to create two Lambda functions, one to check/return the IP address of the caller, and another to update the DNS record in Route53. The Jenkins server will have a job that runs periodically to fetch the IP address of the home network, and if it has changed, it will call the update DNS Lambda function.

The Lambda Functions

First, we’ll create the lambda function to return the IP address of the caller. This function will be a simple Python script that returns the IP address of the caller. Here’s the code:

Check IP Lambda Function:

import json


def lambda_handler(event, context):
    """
    AWS Lambda function to return the requesting client's IP address.

    This function serves as a lightweight service similar to whatismyip.com,
    retrieving the client's IP from the API Gateway request context. It is
    designed for invocation via API Gateway with Lambda Proxy Integration.

    Args:
        event (dict): Contains the request details including the client's IP.
        context (LambdaContext): Provides runtime information.

    Returns:
        dict: An HTTP response with a JSON body containing the client's IP address.
    """
    ip_address = event.get("requestContext", {}).get("http", {}).get("sourceIp", "IP not found")

    return {
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps({"ip": ip_address})
    }

The above lambda function is incredibly simple, and it will be invoked via API Gateway. I’ll be skipping over the Terraform code, but it’s available in the GitHub repository for this blog post.

Next up is to create the lambda function for updating the DNS record in Route53. Since this is performing a write operation, I’ll be securing the function with an API key. Here’s the code for the update DNS Lambda function:

Update DNS Lambda Function:

import os
import json
import boto3

def lambda_handler(event, context):
    """
    Update a DNS A record in Route53.

    This function supports dynamic DNS updates (for example, updating a Minecraft server's external IP).
    It expects a JSON payload with:
      - domain: the DNS record name (e.g. "minecraft.example.com.")
      - ip: the new A record value (e.g. "1.2.3.4")

    The authorization token is expected in the request headers (key "x-auth-token"). The token is verified against
    the AUTH_TOKEN environment variable.

    Returns:
        dict: HTTP response containing a status message.
    """
    expected_token = os.environ.get("AUTH_TOKEN")
    hosted_zone_id = os.environ.get("HOSTED_ZONE_ID")

    # Retrieve auth token from headers
    headers = event.get("headers", {})
    auth_token = headers.get("x-auth-token")

    if auth_token != expected_token:
        return {
            "statusCode": 403,
            "body": json.dumps({"error": "Unauthorized"})
        }

    body = event.get("body")
    if body:
        try:
            data = json.loads(body)
        except Exception:
            return {
                "statusCode": 400,
                "body": json.dumps({"error": "Invalid JSON payload"})
            }
    else:
        data = {}

    domain = data.get("domain")
    new_ip = data.get("ip")
    if not domain or not new_ip:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "Missing 'domain' or 'ip' parameter"})
        }

    if not hosted_zone_id:
        return {
            "statusCode": 500,
            "body": json.dumps({"error": "Hosted zone ID not configured"})
        }

    route53 = boto3.client("route53")
    try:
        response = route53.change_resource_record_sets(
            HostedZoneId=hosted_zone_id,
            ChangeBatch={
                "Comment": "Auto-updated by update_dns_lambda",
                "Changes": [
                    {
                        "Action": "UPSERT",
                        "ResourceRecordSet": {
                            "Name": domain,
                            "Type": "A",
                            "TTL": 300,
                            "ResourceRecords": [{"Value": new_ip}]
                        }
                    }
                ]
            }
        )
    except Exception as e:
        return {
            "statusCode": 500,
            "body": json.dumps({
                "error": "Failed to update DNS record",
                "message": str(e)
            }, default=str)
        }

    return {
        "statusCode": 200,
        "body": json.dumps({
            "message": "DNS record updated",
            "change_info": response
        }, default=str)
    }

There is a bit more going on with the above Lambda function, but it should still be relatively straightforward. The function will accept a JSON payload with the domain and IP address to update. The function will then update the DNS record in Route53.

I’ve secured this with a simple API key which is expected in the x-auth-token header and is verified against an environment variable. In the future, I may revisit this and use a more secure method of authentication. I’ve also considered limiting the DNS record that can be updated to a specific subdomain, but for now, I’m keeping it simple and allowing any record in the hosted zone. In the future, there may be other DNS records for self-hosted services that I want to update dynamically.

Jenkins Jobs

My Jenkins server is running in a Docker container on my home network. It already has a job for posting notifications to my family’s Discord server–I won’t be covering that job in this post, but it is referenced in the jobs we’ll be building today.

We’ll be creating two Jenkins jobs. One to check the current public IP address of the home network and another to update. Logically, we’ll build the jobs “backwards” as the Update job is called last. We’ll implement this first, so that we can reference it when checking the current IP address & name record.

The job is parameterized to accept a DNS_DOMAIN and DNS_IP parameter, for updating the DNS record. A secret, DNS_AUTH_TOKEN, has been configured in the Jenkins credentials manager. Here’s the code for the Update DNS job:

Update DNS Job:

pipeline {
    agent any

    parameters {
        string(name: 'DNS_DOMAIN', defaultValue: 'minecraft.johnsosoka.com', description: 'The DNS record to update')
        string(name: 'DNS_IP', defaultValue: 'AUTO', description: 'IP to set (AUTO uses current public IP)')
    }

    stages {
        stage('Install Dependencies') {
            steps {
                script {
                    sh '''
                        if ! command -v jq &> /dev/null; then
                            echo "🔧 Installing jq..."
                            apt-get update && apt-get install -y jq
                        fi
                    '''
                }
            }
        }

        stage('Update DNS') {
            steps {
                withCredentials([string(credentialsId: 'DNS_AUTH_TOKEN', variable: 'AUTH_TOKEN')]) {
                    script {
                        // Ensure AUTH_TOKEN is passed safely
                        env.AUTH_TOKEN = AUTH_TOKEN

                        def response = sh(script: '''
                            JSON_PAYLOAD=$(printf '{
                                "domain": "%s",
                                "ip": "%s"
                            }' "$DNS_DOMAIN" "$DNS_IP")

                            curl -s -X POST "https://api.johnsosoka.com/v1/dns/update" \
                            -H "Content-Type: application/json" \
                            -H "x-auth-token: $AUTH_TOKEN" \
                            -d "$JSON_PAYLOAD"
                        ''', returnStdout: true).trim()

                        def httpStatus = sh(script: "echo '${response}' | jq -r '.change_info.ResponseMetadata.HTTPStatusCode'", returnStdout: true).trim()
                        def changeStatus = sh(script: "echo '${response}' | jq -r '.change_info.ChangeInfo.Status'", returnStdout: true).trim()

                        if (httpStatus == "200" && (changeStatus == "PENDING" || changeStatus == "INSYNC")) {
                            echo "✅ DNS updated successfully! Status: ${changeStatus}"
                            currentBuild.description = "DNS updated: ${changeStatus}"
                            notifyDiscord("✅ DNS updated for ${DNS_DOMAIN} to ${DNS_IP}. Status: ${changeStatus}")
                        } else {
                            error "❌ DNS update failed: ${response}"
                        }
                    }
                }
            }
        }
    }
}

def notifyDiscord(message) {
    build job: 'notify-discord', parameters: [
            string(name: 'DISCORD_MESSAGE', value: message)
    ]
}

Note the notifyDiscord function at the end of the script. This is a common job that is used to post messages to Discord.

Next up is to create the Check IP Job. This job will check the current public IP address of the home network and compare it to the existing minecraft.johnsosoka.com DNS record. If the IP address has changed, the job will trigger the Update DNS job.

Check IP Job:

pipeline {
    agent any

    environment {
        DNS_DOMAIN = 'minecraft.johnsosoka.com'
        PUBLIC_IP_API = 'https://api.johnsosoka.com/v1/ip/my'
    }

    stages {
        stage('Install Dependencies') {
            steps {
                script {
                    sh '''
                        if ! command -v jq &> /dev/null || ! command -v dig &> /dev/null; then
                            echo "🔧 Installing dependencies..."
                            apt-get update && apt-get install -y jq dnsutils
                        fi
                    '''
                }
            }
        }

        stage('Check Current DNS') {
            steps {
                script {
                    echo "🔹 Checking DNS record for ${DNS_DOMAIN}..."

                    // Get the current IP from DNS
                    def dnsIp = sh(script: "dig +short ${DNS_DOMAIN} | head -n 1", returnStdout: true).trim()

                    // Get the public IP from API
                    def publicIp = sh(script: "curl -s ${PUBLIC_IP_API} | jq -r '.ip'", returnStdout: true).trim()

                    // Output results
                    echo "🔹 Current DNS IP: ${dnsIp}"
                    echo "🔹 Public IP from API: ${publicIp}"

                    // Check if DNS is outdated
                    if (dnsIp == publicIp) {
                        echo "✅ The DNS record is up to date. No action needed."
                    } else {
                        echo "⚠️ DNS record is outdated. Updating to ${publicIp}..."

                        // Trigger the update-jscom-dns job
                        build job: 'update-jscom-dns', parameters: [
                            string(name: 'DNS_DOMAIN', value: DNS_DOMAIN),
                            string(name: 'DNS_IP', value: publicIp)
                        ]

                        // Notify Discord about the update
                        notifyDiscord("⚠️ DNS Record Change Detected: ${DNS_DOMAIN} being routed to ${publicIp}")
                    }
                }
            }
        }
    }
}

// Function to notify Discord
def notifyDiscord(message) {
    build job: 'notify-discord', parameters: [
        string(name: 'DISCORD_MESSAGE', value: message)
    ]
}

I’ve configured the above job to run every hour on the hour with a cron schedule 0 * * * *. To test this out, I’ve set the DNS record to 127.0.0.1 and then executed the job. Here’s the truncated output from the Jenkins console:

🔹 Checking DNS record for minecraft.johnsosoka.com...
[Pipeline] sh
+ dig +short minecraft.johnsosoka.com
+ head -n 1
[Pipeline] sh
+ curl -s https://api.johnsosoka.com/v1/ip/my
+ jq -r .ip
[Pipeline] echo
🔹 Current DNS IP: 127.0.0.1
[Pipeline] echo
🔹 Public IP from API: 24.117.184.224
[Pipeline] echo
⚠️ DNS record is outdated. Updating to 24.117.184.224...
[Pipeline] build (Building update-jscom-dns)
Scheduling project: update-jscom-dns
Starting building: update-jscom-dns #15
Build update-jscom-dns #15 completed: SUCCESS
[Pipeline] build (Building notify-discord)
Scheduling project: notify-discord
Starting building: notify-discord #24
Build notify-discord #24 completed: SUCCESS

The job successfully detected that the DNS record was outdated and triggered the Update DNS job. The Update DNS job then successfully updated the DNS record in Route53 and posted a message to my Discord server!

Conclusion

This was a fun weekend project that I’ve been wanting to do for a while, and I’m glad I finally got around to it. I’ll be able to re-use much of this infrastructure for other self-hosted services in the future. I may eventually restrict which domains can be updated by the Lambda function, but for now, I’m keeping it simple as nothing I host is mission-critical. Another future improvement will be to host the pipeline DSL in a Jenkinsfile in the GitHub repository for this project, instead of directly in the Jenkins job configuration.

Hopefully this post has been helpful to you, and if you have any questions or suggestions, feel free to reach out via the contact form.

The full code for this project, including the Terraform, can be found on GitHub

Happy New Year!

2025-01-04T00:00:00-07:00

🎉 Happy New Year!

2024 was a big year for AI, Agents, and me. I’m excited to see what 2025 has in store. I started experimenting with LLMs via OpenAI in late 2022, and built my first LLM application around mid 2023. Around the start of 2024, I had the wonderful privilege and opportunity to focus on AI/LLM Development full-time with Commerce Architects.

This past year, building Agents & Multi-Actor systems really began to take off across the industry. While I cannot share details on my personal blog, we have shipped some really cool Agent-based applications to production for a handful of clients. Our labs team produced an AI Agent that interfaces with the commercetools platform that was named “Accelerator of the Week” by commercetools. I’m really proud of the work we’ve done and the team we’ve built.

I’ve been trying to share more knowledge as I acquire it. From a LangChain4J introduction to a Python LangGraph LLM-based query/model router, I created an example implementation of the DeepMind SELF-DISCOVER algorithm.

I’m looking forward to learning & sharing more in 2025. I think that we are on the cusp of a knowledge revolution, and this year will yield some fascinating new software patterns & breakthroughs around AI & Agent design/orchestration.

Here’s to fun and interesting 2025!

Unit Testing Large Language Models: Agentic Test Evaluation with LangChain4J

2024-07-21T00:00:00-06:00

Note: This article assumes familiarity with LangChain4j, an LLM Integration framework. For a primer on this library, you can read an introduction that I wrote here.

Unit tests are a critical part of enterprise software development. Not only do unit tests help validate the expected behavior of the code, but they also serve as a form of documentation and give developers the confidence to refactor and contribute to the codebase. I have worked on software projects lacking unit tests, and have seen the negative impact on developer confidence & productivity.

Testing Large Language Models (LLMs) is a unique challenge. Particularly because of the non-deterministic nature of these models. It isn’t always as simple as asserting that the output of a function is equal to an expected value as there can be many ways for an LLM to potentially phrase a correct answer. In today’s post, I will be walking through a handful of strategies for unit testing LLMs with LLMs. We will start simple, and then build our way up to a MultiPhaseEvaluator, which can guides a test agent through creating a test plan, executing on that plan (agent to agent interaction), and then evaluating the results.

Setup

To evaluate LLM performance in Unit Tests, we’re going to need something to test. To achieve this, I will be recreating the Hotel Booking Agent example that I built with Spring AI in a previous article. You can read the original blog post here. The project contains a simple hotel booking agent with access to tools to check availability, book rooms, and look up reservations.

The first thing I’ve done is copied the existing dummy HotelBookingService class from the Spring AI project. This class contains the logic for checking availability, booking rooms, and looking up reservations. Once copied, I needed to define the LangChain4J toolkit, which will be exposed to the booking agent. It simply wraps the HotelBookingService:

@Component
@RequiredArgsConstructor
public class BookingTools {

    private final HotelBookingService hotelBookingService;
    
    @Tool("Check Availability -- Useful for seeing if a room is available for a given date.")
    public boolean checkAvailability(String date) {
        LocalDate parsedDate = LocalDate.parse(date);
        return hotelBookingService.isAvailable(parsedDate);
    }

    @Tool("Book Room -- Useful for booking a room for a given guest name, check-in date, and check-out date.")
    public String bookRoom(String guestName, String checkInDate, String checkOutDate) {
        LocalDate checkIn = LocalDate.parse(checkInDate);
        LocalDate checkOut = LocalDate.parse(checkOutDate);
        return hotelBookingService.bookRoom(guestName, checkIn, checkOut);
    }

    @Tool("Find Booking -- Useful for finding a booking by guest name.")
    public String findBooking(String guestName) {
        return hotelBookingService.findBookingByGuestNameStr(guestName);
    }

}

Next up, I’ll define the LangChain4J AIService. This class will define the role of the agent, as well as an entrypoint to interface with the LLM. Furthermore, we can easily attach this to a @Tool exposing it to the HotelBookingAgent which is to be tested.

package com.johnsosoka.langchainbookingtests.agent;

import dev.langchain4j.service.SystemMessage;

public interface BookingAgent {

    @SystemMessage({
            "You are a booking agent for an online hotel. You are here to help customers book rooms and check ",
            "availability. Use the tools you have access to in order to help customers with their requests. You can ",
            "check availability, book rooms, and find bookings."
    })
    String chat(String message);
}

In a Spring configuration class, we will equip the agent with a toolkit, large language model (GPT-4o), and a ChatMemory.

@Configuration
public class BookingAgentConfig {

    @Value("${openai.api-key}")
    String apiKey;

    @Bean
    public ChatLanguageModel chatLanguageModel() {
        return OpenAiChatModel.builder()
                .modelName(OpenAiChatModelName.GPT_4_O)
                .apiKey(apiKey)
                .build();
    }

    @Bean
    public BookingAgent bookingAgent(BookingTools bookingTools, ChatLanguageModel chatLanguageModel) {
        return AiServices.builder(BookingAgent.class)
                .chatLanguageModel(chatLanguageModel)
                .tools(bookingTools)
                .chatMemory(MessageWindowChatMemory.withMaxMessages(50))
                .build();
    }

}

Finally, I will create an additional service class that will be used to interact with the agent. Remember, we’re just setting up a dummy application so that we have something to test–This is not a production-ready application, and as such will not support concurrent conversations.

package com.johnsosoka.langchainbookingtests.service;

import com.johnsosoka.langchainbookingtests.agent.BookingAgent;
import lombok.RequiredArgsConstructor;
import org.springframework.stereotype.Service;

@Service
@RequiredArgsConstructor
public class ChatService {

    private final BookingAgent bookingAgent;

    public String chat(String message) {
        return bookingAgent.chat(message);
    }

}

The SpringAI HotelBookingAgent has now been migrated to LangChain4J! We can now begin writing unit tests for the agent.

Unit Testing

The HotelBookingService has two hardcoded dates: January 15, 2025 (available) and February 28, 2025 (unavailable). We can use these dates to test the agent’s ability to check availability, book rooms, and find bookings.

Testing Without Agents

To begin, I’ll set up an integration test for the ChatService, and evaluate the response using contains to assert that the agent’s response contains the expected output.

@SpringBootTest
@Slf4j
class ChatServiceTestIT {

    @Autowired
    private ChatService chatService;

    @Test
    public void checkAvailability() {
        String response = chatService.chat("Is the hotel available on 2022-12-12?");
        log.info("Response: {}", response);
        assertTrue(response.contains("not available"));
    }
}

When this test executes, the agent will respond with a message indicating that the hotel is not available on the given date. Here are sample outputs from three different execution runs of the test:

Response: The hotel is not available on 2022-12-12. Would you like to check for other dates or make a 
booking for different dates?

Response: The hotel is not available on 2022-12-12. Would you like to check for alternative dates or 
make a booking for a different date?

Response: I'm sorry, but the hotel is not available on 2022-12-12. Is there another date you would 
like to check for availability?

You can see that the agent’s response can vary slightly, due to the non-deterministic nature of the language model. While we could potentially assert that the response contains the words “not available,” this would be a brittle test. Instead, we can use a more robust approach by creating an LLM Evaluator agent.

Simple Agent-Based Evaluation

We can use an agent-based approach to evaluate the agent’s responses. This approach involves creating an agent that can be tasked with evaluating the responses of another agent. The evaluator agent will be provided with the conditions that the response must meet as well as the response itself to evaluate.

Let’s first define the TestEvaluationAgent interface:

public interface TestEvaluationAgent {

    @SystemMessage({
            "You purpose is to evaluate the results of a test. You will be employed in a unit testing environment, ",
            "and must critically evaluate the provided conditions and results to determine if the test has passed or ",
            "failed. Consider a passing test True, and a failing test False."
    })
    @UserMessage({
            "Evaluate the following:\n",
            "Condition: {{condition}}\n",
            "-----\n",
            "Results: {{result}}",
    })
    public Boolean evaluate(@V("condition") String condition, @V("result") String result);
}

In the above, you can see how we’re defining the “profile” or “role” of the TestEvaluationAgent. The @SystemMessage annotation clearly explains to the LLM what its purpose is. The @UserMessage annotation provides a template for the agent to use when evaluating the results.

For a quick test, let’s wire up this agent to evaluate the response of the ChatService test we wrote earlier:

...
    @Autowired
    private ChatLanguageModel chatLanguageModel;

    private TestEvaluationAgent testEvaluationAgent;

    @BeforeEach
    public void setUp(){
        testEvaluationAgent = provisionEvaluationAgent();
    }

    @Test
    public void checkAvailability_withTestEvaluationAgent() {
        String response = chatService.chat("Is the hotel available on 2025-02-28?");
        log.info("Response: {}", response);

        String condition = "It should be determined that there are no hotel rooms available on 2025-02-28";
        Boolean evaluationResult = testEvaluationAgent.evaluate(condition, response);
        assertTrue(evaluationResult);
    }


    private TestEvaluationAgent provisionEvaluationAgent() {
        return AiServices.builder(TestEvaluationAgent.class)
                .chatLanguageModel(chatLanguageModel)
                .chatMemory(MessageWindowChatMemory.withMaxMessages(10))
                .build();
    }
...

The above test will pass the response of the BookingAgent to the TestEvaluationAgent along with the conditions of satisfaction for evaluation. The TestEvaluationAgent will then evaluate the response and return a boolean value, True if the response meets the conditions, and False if it does not.

You may have noticed that the TestEvaluationAgent is provisioned using the existing ChatLanguageModel defined in the Spring configuration from earlier. It is worth noting that developers are not limited to re-using models. There are new fine-tuned models being released frequently that may be better suited for specific tasks like hallucination detection or critiquing.

Multi-Pass Agent Evaluation

Our TestEvaluationAgent is also subject to the non-deterministic nature of the language model. To mitigate this, we can use a multi-pass evaluation strategy. This strategy will involve evaluating the result multiple times and taking the majority vote as the final result.

Below is a simple implementation of the multi-pass evaluation strategy:

@Builder
@Slf4j
public class MultiPassEvaluator {

    private TestEvaluationAgent testEvaluationAgent;
    // The total number of times to evaluate the result
    private Integer passCount;

    public Boolean evaluate(String condition, String result) {
        Boolean evaluationResult = false;
        int successCount = 0;
        for (int i = 0; i < passCount; i++) {
            boolean evaluation = testEvaluationAgent.evaluate(condition, result);
            if (evaluation) {
                successCount++;
                log.info("Evaluation {} passed", i);
            } else {
                log.info("Evaluation {} failed", i);
            }
        }
        // If more than half of the evaluations are successful, then the test is considered successful
        return successCount >= passCount / 2;
    }

}

I’ll wire this up to another test:

...
@BeforeEach
public void setUp(){
    testEvaluationAgent = provisionEvaluationAgent();
    multiPassEvaluator = MultiPassEvaluator.builder()
            .testEvaluationAgent(testEvaluationAgent)
            .passCount(3)
            .build();
}

@Test
public void checkAvailability_withMultiPassEvaluator() {
    String response = chatService.chat("Is the hotel available on 2025-02-28?");
    log.info("Response: {}", response);

    String condition = "It should be determined that there are no hotel rooms available on 2025-02-28";
    Boolean evaluationResult = multiPassEvaluator.evaluate(condition, response);
    assertTrue(evaluationResult);
}
...

Here is the output from a test run:

2024-07-21T17:05:35.367-06:00  INFO c.j.l.service.ChatServiceTestIT          : Response: The hotel is not available on 2025-02-28. If you would like to check availability for another date or have any other requests, please let me know!
2024-07-21T17:05:35.925-06:00  INFO c.j.l.helper.MultiPassEvaluator          : Evaluation 0 failed
2024-07-21T17:05:36.430-06:00  INFO c.j.l.helper.MultiPassEvaluator          : Evaluation 1 passed
2024-07-21T17:05:36.829-06:00  INFO c.j.l.helper.MultiPassEvaluator          : Evaluation 2 passed

Interestingly enough, the evaluation failed on the first pass, but passed on the following two passes. This is largely why we use a multi-pass evaluation strategy. It helps to mitigate the non-deterministic nature of the LLM tasked with evaluating the results. In a production environment, you may want to increase the number of passes and potentially tweak the temperature of the underlying ChatLanguageModel to improve evaluation accuracy.

Multi-Phase Agent Evaluation (Plan, Test & Evaluate)

The final strategy that I’ll cover in this article is Multi-Phase Agent Evaluation. With this strategy, instead of performing the same evaluation multiple times, we will instead guide an agent through multiple phases: Planning, Execution & Evaluation.

We will continue utilizing an LLM to evaluate our BookingAgent LLM, which is exposed via the ChatService. This Agent will be provided a description for the expected behavior of the system, and it will both generate a test plan and execute on that plan.

The TestAgent will be able to interact with the BookingAgent by exposing it as a @Tool to the QA agent. The TestAgent will then be able to chat with the BookingAgent like a customer would.

First, we’ll wrap the ChatService in a BookingAgentTool:

@Component
@RequiredArgsConstructor
@Slf4j
public class BookingAgentTool {

    private final ChatService chatService;

    @Tool("Interact with the Booking Agent -- Useful for testing the Booking Agent system")
    public String interactWithBookingAgent(String message) {
        log.info("QA Agent Message: {}", message);
        String response = chatService.chat(message);
        log.info("Booking Agent Response: {}", response);
        return response;
    }

}

By exposing the ChatService (and by extension the BookingAgent) as a @Tool, any agent equipped with the BookingAgentTool component, will be able to interact with the BookingAgent as though it were a customer or QA tester.

Next, we will define and create several methods encapsulating the different phases our TestAgent will be guided through.

We will define a method and prompt to:

Generate a test plan
Execute the test plan
Evaluate the test results

public interface TestAgent {

    @SystemMessage({
            "You are a world class QA engineer, your job is to test the system and ensure that it is working as expected.",
            "You will be provided with a test plan, and it is your job to execute each test case individually and determine",
            "if the system is working as expected.",
            "You will act as a customer interacting with a chatbot system to test the system's behavior.",
    })
    public String test(String testCases);

    @SystemMessage({
            "You are a world class QA engineer, your job is to test the system and ensure that it is working as expected.",
            "You will be provided with an explanation of the System's behavior and you must carefully write test cases to",
            "ensure that the system meets the expected behavior. Your test cases should be a detailed description for usage",
            "by a different language model.",
            "The System being tested is another Large Language Model, so the inputs and expected outputs can be in natural language.",
            "Account for this possible variability in the rigidity of evaluation criteria."
    })
    @UserMessage({
            "Write test cases for the following system behavior:\n",
            "System Behavior: {{systemBehavior}}\n"
    })
    public String writeTestCases(@V("systemBehavior") String systemBehavior);

    @SystemMessage({
            "You must carefully evaluate the results of the test plan to determine if the system is working as expected.",
            "In the event of any failures, the result should be false. Otherwise, the result should be true."
    })
    @UserMessage("Evaluate the following test execution results: {{it}}")
    public Boolean evaluateResults(String testResults);

}

Finally, we will create an MultiPhaseEvaluator class which will handle the flow-control of the TestAgent:

@RequiredArgsConstructor
@Slf4j
public class MultiPhaseEvaluator {

    private final TestAgent testAgent;

    /**
     * Generates a test plan, executes the test plan, and evaluates the results for a given system description.
     * @param systemDescription
     * @return
     */
    public TestPlanResult generateAndExecuteTestPlan(String systemDescription) {
        String testCases = testAgent.writeTestCases(systemDescription);
        String testPlanResults = testAgent.test(testCases);
        Boolean testPlanResult = testAgent.evaluateResults(testPlanResults);
        return TestPlanResult.builder()
                .testPlan(testCases)
                .testPlanResults(testPlanResults)
                .allTestsPassed(testPlanResult)
                .build();
    }
}

Notice above that we’re passing the output from one LLM invocation to the next. This flow-control allows us to guide LLMs with task-specific prompts through a series of logical steps.

I’ve created helper methods to provision the TestAgent and MultiPhaseEvaluator class, you can view this in the complete example on Github. The important part is seeing this added to the test:

    @Test
    public void testPlanCreationTest() {
        String systemDescription = """
                The system is a simple hotel booking agent. The agent should have the ability to:
                - Check the availability of a hotel room for a given date
                - Book a hotel room for a guest (check in & check out date required)
                - Lookup a booking by guest name
                
                The system has the following preconditions:
                - The system has a hotel with 1 room available on 2025-01-15
                - The system has a hotel with 0 rooms available on 2025-02-28
                - All other dates should be considered unavailable
                """;
        
        TestPlanResult testPlanResult = multiPhaseEvaluator.generateAndExecuteTestPlan(systemDescription);
        log.info("Test Plan: \n{}", testPlanResult.getTestPlan());
        log.info("Test Plan Results: \n{}", testPlanResult.getTestPlanResults());
        assertTrue(testPlanResult.getAllTestsPassed());
    }

You can see that we’re describing the capabilities of the system that’s going to be evaluated. Remember, we’re equipping the TestAgent with a tool that allows it to interact with the BookingAgent via the ChatService. The MultiPhaseEvaluator will guide the TestAgent through writing test cases, executing on those test cases, and evaluating the results.

Below is the conversation between the QA Agent and the Booking Agent:

QA Agent Message - Check availability for 2025-01-15
Booking Agent Response - There is availability for a room on January 15, 2025. Would you like to proceed with a booking? If so, please provide the guest name, check-in date, and check-out date.
QA Agent Message - Check availability for 2025-02-28
Booking Agent Response - Unfortunately, there is no availability for a room on February 28, 2025. Would you like me to check another date or assist you with anything else?
QA Agent Message - Check availability for 2025-03-10
Booking Agent Response - There is no availability for a room on March 10, 2025. Would you like me to check another date or help with something else?
QA Agent Message - Book a room for John Doe from 2025-01-15 to 2025-01-16
Booking Agent Response - The room has been successfully booked for John Doe from January 15, 2025, to January 16, 2025. If there's anything else you need assistance with, please let me know!
QA Agent Message - Book a room for Jane Smith from 2025-02-28 to 2025-03-01
Booking Agent Response - There is no availability on February 28, 2025, so I'm unable to book a room for Jane Smith on those dates. Would you like me to check dates around that period or assist you with something else?
QA Agent Message - Book a room for Alice Brown from 2025-03-10 to 2025-03-11
Booking Agent Response - There is no availability on March 10, 2025, so I'm unable to book a room for Alice Brown on those dates. Would you like me to check dates around that period or assist you with something else?
QA Agent Message - Lookup booking for John Doe
Booking Agent Response - I found a booking for John Doe with the following details:
- **Check-in Date:** January 15, 2025
- **Check-out Date:** January 16, 2025

Is there anything else you need assistance with?
QA Agent Message - Lookup booking for Jane Smith
Booking Agent Response - There is no booking found for Jane Smith. If you need help with something else or would like to make a new booking, please let me know!
QA Agent Message - Book a room for Mark Taylor from 2025-01-16 to 2025-01-15
Booking Agent Response - It looks like the check-out date is before the check-in date. Please provide the correct check-in and check-out dates so I can proceed with the booking for Mark Taylor.
QA Agent Message - Book a room for Emily Davis on 2025-01-15
Booking Agent Response - Could you please provide the check-out date for Emily Davis's booking on January 15, 2025, so I can proceed with the reservation?

This is pretty interesting stuff, the TestAgent is able to generate a test plan & interact with the booking agent to execute on that plan. The abridged version of the plan generated for the above execution run was:

Check availability for a date with available rooms
Check availability for a date with no available rooms
Check availability for a date not listed in preconditions
Book a room for a guest with available dates
Book a room for a guest with no available rooms
Book a room for a guest with dates not listed in preconditions
Lookup a booking by guest name that exists
Lookup a booking by guest name that does not exist
Book a room for a guest with check-out date before check-in date
Book a room for a guest with check-in and check-out date being the same

We could feasibly combine the multi-pass evaluation strategy with the agentic QA evaluation strategy to further improve the accuracy and robustness of our tests; But, that is a project for another day.

Conclusion

Testing Large Language Models is a new and unique challenge. I’m really excited to see what other ideas the industry comes up with in the future. We covered quite a bit of ground today, starting with a simple unit test using contains and working our way to a fully functional MultiPhaseEvaluator capable of generating test plans and executing on them. Hopefully, this article has given you some ideas on how to test your own LLM applications. This was a fun project to work on, and I hope you found it as interesting as I did. Watching the two agents interact with each other was thrilling, and being able to use a junit assertion to evaluate the results was the cherry on top.

The complete example can be found on my GitHub here

Happy coding!

The Basic Building Blocks of Agents

2024-05-23T00:00:00-06:00

I recently published a blog post for Commerce Architects introducing the basic building blocks for creating Agents/Agentic workflows in Java with LangChain4J. If you want a simple introduction to building these types of applications, this is it! In future articles, I’ll be exploring the design & architecture of these thinking machines in more depth. Let me know what you think!

Read the Full Article Here: The Basic Building Blocks of Agents

Exploring Spring AI: Building a Simple Hotel Booking Agent

2024-03-24T00:00:00-06:00

I recently came across the Spring AI project, which “aims to streamline the development of applications that incorporate artificial intelligence functionality without unnecessary complexity.”

Thus far, I’ve been relying on the LangChain4J Framework for my AI projects, but as a Java developer & Spring enthusiast, I was excited to see what Spring AI had to offer. Unfortunately, at the time of writing, the latest stable release 0.8.1 does not support function calling, which is critical for most advanced use cases. As such, I will be working using the unstable 1.0.0-SNAPSHOT. Function calls enable the Agent to “interact” with the rest of our software & 3rd party services–The framework will intercept a tool invocation request & call the appropriate method defined in the callback.

Today’s Project: Today, we will be building a simple Spring AI agent that will help manage a dummy hotel booking system. It will be able to check availability, book rooms, and look up bookings by guest name. Function calls will be used to expose these capabilities to the Agent.

I’m going to keep this project simple and focus more on utilizing the Spring AI framework rather than building a bullet-proof, production ready Agent. The complete code will be available on GitHub

Getting Started

Dependencies

Before adding the Spring AI dependency to your project, you will need to add the Spring AI Snapshot repository to your pom.xml file.

    
            spring-snapshots
            Spring Snapshots
            https://repo.spring.io/snapshot
            
                false

Once that is added, we’ll now be able to access the 1.0.0-SNAPSHOT version of Spring AI–which is reported as unstable, but does support function calling.

Next, we’ll add the Spring AI dependency to our pom.xml file.

    
        org.springframework.experimental
        spring-ai
        1.0.0-SNAPSHOT
    

Credentials

I’ll be using OpenAI for this project. The Spring-AI framework allows us to create an entry in our application.properties file:

spring.application.name=spring-ai-booking
spring.ai.openai.api-key=${OPENAI_API_KEY}
spring.ai.openai.chat.options.model=gpt-4-1106-preview

I’m setting an environment variable OPENAI_API_KEY to my OpenAI API key in my Run Configuration. I also have opted to specify the model. gpt-4-1106-preview is a solid choice for our project as it boasts improved function call capabilities and a larger context window.

Sanity-Check

Now that our dependencies are set up, let’s do a quick sanity-check to ensure that everything is working as expected.

The following code snippet was pulled directly from the Spring AI documentation:

@RestController
public class SimpleAiController {

    private final ChatClient chatClient;

    @Autowired
    public SimpleAiController(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    @GetMapping("/ai/simple")
    public Map<String, String> completion(@RequestParam(value = "message", defaultValue = "Tell me a joke") String message) {
        return Map.of("generation", chatClient.call(message));
    }
}

With my application running, I want to test the endpoint by sending a GET request to http://localhost:8080/ai/simple

/tmp ❯ curl http://localhost:8080/ai/simple
{"generation":"Why did the scarecrow win an award?\nBecause he was outstanding in his field!"}%

Looks like we’re in business! Now we can move on to building our Hotel Booking Agent.

Note: Thus far, this is all the code required. The Spring-AI framework was able to configure and autowire the ChatClient for us using the API key we provided in the application.properties file. LangChain4J requires a bit more configuration to get started.

Creating the Hotel Booking Agent

The hotel booking agent will be a simple agent that can handle the following commands:

check availability
book a room
look up a booking by guest name

In addition to these command capabilities, we will be using a simple in-memory data store to manage both the hotel bookings and the conversation context. I mentioned earlier that this exploration is more about the Spring AI framework than building a production-ready agent, so we will only support a single conversation at a time.

Booking Service and Function Calls

I’ve created a simple booking service that will manage the hotel bookings. While I will spare you the details of its implementation, as it is not the focus of this post, I do want to point out some conditions I’ve set up for the service:

    /**
     * Initializes the availability of rooms for specific dates for demonstration purposes.
     */
    @PostConstruct
    public void init() {
        // Set availability for January 15, 2025 (available)
        LocalDate availableDate = LocalDate.of(2025, 1, 15);
        setAvailability(availableDate, 5);

        // Set availability for February 28, 2025 (unavailable)
        LocalDate unavailableDate = LocalDate.of(2025, 2, 28);
        setAvailability(unavailableDate, 0);
    }

The full service can be found here.

While the service exposes methods to check availability, book a room, and cancel a booking, the LLM Agent will not be able to interact with the service directly. To wire up the service to the LLM Agent, we will need to define a function and expose it to the LLM Agent.

In the official Spring AI function calling documentation, they provide the following example:

public class MockWeatherService implements Function<Request, Response> {

	public enum Unit { C, F }
	public record Request(String location, Unit unit) {}
	public record Response(double temp, Unit unit) {}

	public Response apply(Request request) {
		return new Response(30.0, Unit.C);
	}
}

The above is supposed to wrap a 3rd party service. After spending some time reading the documentation and source code, I was unable to find a way to expose multiple methods on a service without creating multiple classes. I was really hoping that exposing a service call would be as simple as a @Tool annotation on a method, like in the LangChain4J framework.

Booking Tools

We will define a function that will expose a single method call on the HotelBookingService to the Agent.

@Component
@RequiredArgsConstructor
public class CheckAvailabilityTool implements Function<CheckAvailabilityTool.Request, CheckAvailabilityTool.Response> {

    private final HotelBookingService hotelBookingService;

    public record Request(String date) {}
    public record Response(boolean available) {}

    @Override
    public Response apply(Request request) {
        // LocalDate from a string
        LocalDate date = LocalDate.parse(request.date);
        Boolean isAvailable = hotelBookingService.isAvailable(date);

        return new Response(isAvailable);
    }

}

To test this out, I’ve modified the SimpleAiController to test it out, here’s how it looks now:

@RestController
public class SimpleAiController {

    private final ChatClient chatClient;
    private final CheckAvailabilityTool checkAvailabilityTool;

    @Autowired
    public SimpleAiController(ChatClient chatClient, CheckAvailabilityTool checkAvailabilityTool) {
        this.chatClient = chatClient;
        this.checkAvailabilityTool = checkAvailabilityTool;
    }

    @GetMapping("/ai/simple")
    public Map<String, String> completion(@RequestParam(value = "message", defaultValue = "Do you have any rooms available on February 28, 2025") String message) {

        UserMessage userMessage = new UserMessage(message);

        var promptOptions = OpenAiChatOptions.builder()
                .withFunctionCallbacks(List.of(FunctionCallbackWrapper.builder(checkAvailabilityTool)
                        .withName("CheckAvailability")
                        .withDescription("Check the availability of rooms for a specific date")
                        .withResponseConverter((response) -> "" + response.available())
                        .build()))
                .build();

        ChatResponse response = chatClient.call(new Prompt(List.of(userMessage), promptOptions));
        return Map.of("generation", response.getResult().toString());
    }
}

You can see that the ChatClient/LLM is now aware of the CheckAvailabilityTool function. I’ve also updated the default message value to include a date that is set to be unavailable in the HotelBookingService.

Let’s test it out:

/tmp ❯ curl http://localhost:8080/ai/simple
{"generation":"Generation{assistantMessage=AssistantMessage{content='I'm sorry, but we do not have any rooms available on February 28, 2025.', properties={role=ASSISTANT, finishReason=STOP, id=chatcmpl-9658XUpnLnRKZ2LUlbk4DbF00WBBV}, messageType=ASSISTANT}, chatGenerationMetadata=org.springframework.ai.chat.metadata.ChatGenerationMetadata$1@642222bf}"}

Perfect! The LLM was able to call the CheckAvailabilityTool function and respond accordingly. I will continue to build wrapper functions for the BookRoom and CancelBooking methods in the HotelBookingService.

Chat Management

Spring AI doesn’t appear to provide a mechanism OOTB for managing the conversation context. I will be using a simple in-memory store using another singleton scoped spring service to manage a single conversation.

Note: With LLM APIs, they generally do not manage any of the conversation context. This is typically handled by the client, which will save all relevant messages & provide them upon each request for the model to generate a response.

Below is the simple ConversationService I’ve created:

@Service
@Scope("singleton")
@Slf4j
public class ConversationService {
    private List<Message> messageList = Collections.synchronizedList(new ArrayList<>());

    /**
     * Adds a message to the conversation.
     *
     * @param message the message to be added to the conversation
     * @return the updated list of messages in the conversation
     */
    public synchronized List<Message> addMessage(Message message) {
        messageList.add(message);
        log.info("Added message to conversation: {}, total messages: {}", message, messageList.size());
        return messageList;
    }

    /**
     * Retrieves all messages in the conversation.
     *
     * @return the list of messages in the conversation
     */
    public synchronized List<Message> getAllMessages() {
        log.info("Retrieved all {} messages", messageList.size());
        return messageList;
    }
}

In a real world application, this would be replaced with a more robust solution, such as Redis or a database–and would support concurrent conversations. As this is written, it will only save the context of our current conversation.

The Agent

Now that we have the services we will require, as well as methods on those services exposed as tools, we can build the Agent. The Agent will need to utilize the ConversationService to manage the conversation context, and will need to be aware of the CheckAvailabilityTool, BookRoomTool, and CancelBookingTool capabilities to fulfill its mission as a hotel booking agent.

Review the code below:

@Component
@RequiredArgsConstructor
public class BookingAgent {

    private final ChatClient chatClient;
    private final CheckAvailabilityTool checkAvailabilityTool;
    private final FindBookingTool findBookingTool;
    private final BookRoomTool bookRoomTool;
    private final ConversationService conversationService;

    /**
     * When the BookingAgent is created, we will define the agent's role as a SystemMessage at the top of the conversation.
     */
    @PostConstruct
    public void defineAgentProfile() {
        String agentProfile = "You are a booking agent for an online hotel. You are here to help customers book rooms and check availability." +
                "Use the tools you have access to in order to help customers with their requests. You can check availability, book rooms, and find bookings.";
        SystemMessage systemMessage = new SystemMessage(agentProfile);
        conversationService.addMessage(systemMessage);
    }


    /**
     * When a message is sent to the agent, the agent will handle the message and return a response.
     * @param message
     * @return
     */
    public String handleMessage(String message) {
        // Add the user message to the conversation
        UserMessage latestMessage = new UserMessage(message);
        conversationService.addMessage(latestMessage);

        List<Message> messages = conversationService.getAllMessages();
        var promptOptions = getPromptOptions();

        ChatResponse response = chatClient.call(new Prompt(messages, promptOptions));
        // Add the assistant response to the conversation
        conversationService.addMessage(response.getResult().getOutput());

        // Return the assistant response
        return response.getResult().getOutput().getContent();

    }

    /**
     * Expose function callbacks to the OpenAI chat client
     *
     * @return
     */
    private OpenAiChatOptions getPromptOptions() {
        return OpenAiChatOptions.builder()
                .withFunctionCallbacks(List.of(FunctionCallbackWrapper.builder(checkAvailabilityTool)
                                .withName("CheckAvailability")
                                .withDescription("Helpful for checking the availability of rooms for a specific date, this should be used before booking a room for a new guest.")
                                .withResponseConverter((response) -> "" + response.available())
                                .build(),
                        FunctionCallbackWrapper.builder(bookRoomTool)
                                .withName("BookRoom")
                                .withDescription("Helpful for booking a room for a new guest for a specific check-in and check-out date")
                                .withResponseConverter((response) -> response.bookingStatus())
                                .build(),
                        FunctionCallbackWrapper.builder(findBookingTool)
                                .withName("FindBooking")
                                .withDescription("Helpful to determine if an existing guest has booked a room")
                                .withResponseConverter((response) -> response.booking())
                                .build()))
                .build();
    }
}

The defineAgentProfile method is used to set the stage for the agent. This would be referred to as “The Profiling Module” in Unified Framework for LLM Based Agents. When the agent is created a SystemMessage explaining the agent’s role is added to the top of the conversation.

When ChatResponse response = chatClient.call(new Prompt(messages, promptOptions)); is called, the Agent is provided the entire context (view the messages array being passed to a new prompt) as well as the promptOptions which include all of the function callbacks that the Agent is has access to.

Note: Spring provides multiple ways to expose tools to the Agent. Including @Bean annotations in a configuration class. I opted to use the @Component annotation to keep everything in one place so that readers could see the entire Agent in one place. You can read about other ways to expose tools in the Spring AI documentation.

Testing the Agent

In order to test the agent, I created a simple BookingAgentTest class which will invoke it with a few different messages. Here are the contents of that test, then we’ll walk through the output:

@SpringBootTest
class BookingAgentTest {

    @Autowired
    private BookingAgent bookingAgent;

    @Test
    public void testBookingConversation() {
        String firstMessage = "Hi, my name is John--Can you see if any rooms are available on February 28, 2025?";
        System.out.println(bookingAgent.handleMessage(firstMessage));
        String availability = "Do you have any availability on January 15th, 2025?";
        System.out.println(bookingAgent.handleMessage(availability));
        // Start a new conversation
        String alternativeDate = "Please book 1 room for John on January 15h, 2025. The check-out date will be January 21st, 2025.";
        // Expect a successful booking
        System.out.println(bookingAgent.handleMessage(alternativeDate));
        String checkBooking = "Can you see if a guest John has reserved any rooms?";
        // Expect a yes
        System.out.println(bookingAgent.handleMessage(checkBooking));

        // Demonstrate persisted conversation context
        String summarize = "Can you summarize our discussion today?";
        System.out.println(bookingAgent.handleMessage(summarize));
    }
}

In the above, we can see that we set up a few different scenarios for the agent to handle.

Check on a date that is unavailable - The Agent should look up the availability and respond that there are no rooms available.
Check on a date that is available - The Agent should look up the availability and respond that there are rooms available.
Book a room - The Agent should book a room for John on the specified date range.
Check if a booking exists - The Agent should respond that a booking exists for John.
Summarize the conversation - The Agent should summarize the conversation.

The final point to summarize the conversation is really just to illustrate that the ConversationService is correctly managing the conversation. It should be housing all messages from both Agent & User and enable to Agent to provide a complete summary of the conversation.

Here are some selected outputs from the test (I’ve added the comments over log statements to provide context):

# 1st User message
Added message to conversation: UserMessage{content='Hi, my name is John--Can you see if any rooms are available on February 28, 2025?', properties={}, messageType=USER}, total messages: 2
# Agent response indicating Feb 28, 2025 is unavailable (as expected)
Added message to conversation: AssistantMessage{content='I'm sorry, John, but it looks like there are no rooms available on February 28, 2025. If you have any flexibility with your dates or need assistance with anything else, please let me know, and I'll be happy to help.', properties={role=ASSISTANT, finishReason=STOP, id=chatcmpl-96K38H75d6hEhDErQtBf3UiT5VZRC}, messageType=ASSISTANT}, total messages: 3
I'm sorry, John, but it looks like there are no rooms available on February 28, 2025. If you have any flexibility with your dates or need assistance with anything else, please let me know, and I'll be happy to help.
# User suggests a different date
Added message to conversation: UserMessage{content='Do you have any availability on January 15th, 2025?', properties={}, messageType=USER}, total messages: 4
# Agent response indicating Jan 15, 2025 is available (as expected)
Added message to conversation: AssistantMessage{content='Great news, John! We do have rooms available on January 15th, 2025. If you'd like to proceed with booking a room, please provide me with the details, such as the number of nights you'd like to stay and any specific room preferences you may have.', properties={role=ASSISTANT, finishReason=STOP, id=chatcmpl-96K3CelXXwBTlN0SJrBDFBDS9HmcD}, messageType=ASSISTANT}, total messages: 5
# User requests a booking
Added message to conversation: UserMessage{content='Please book 1 room for John on January 15h, 2025. The check-out date will be January 21st, 2025.', properties={}, messageType=USER}, total messages: 6
# Agent response indicating the booking was successful
Added message to conversation: AssistantMessage{content='Your room has been successfully booked, John! You'll be staying from January 15th to January 21st, 2025. If you need any further assistance or confirmation details, please don't hesitate to ask. Enjoy your stay!', properties={role=ASSISTANT, finishReason=STOP, id=chatcmpl-96K3HwfCdXdiUy7iRvrjo4NKHrWn7}, messageType=ASSISTANT}, total messages: 7
# User checks if a booking exists (Just to demonstrate the FindBookingTool)
Added message to conversation: AssistantMessage{content='Yes, John, you have a reservation with us. Here are the details of your booking:

- Check-in Date: January 15, 2025
- Check-out Date: January 21, 2025

If you need any further assistance regarding your booking or if there's anything else I can help you with, please let me know.', properties={role=ASSISTANT, finishReason=STOP, id=chatcmpl-96K3LfNZJsHYfXCOsfMT6HfTOFn0r}, messageType=ASSISTANT}, total messages: 9

We can see that the Agent is able to handle the different scenarios and provide the appropriate responses. The Agent is invoking tools appropriately, tracking the conversation & relaying accurate information. I’m happy with the results!

To button things down, I’ve adjusted the SimpleAIController to use the BookingAgent which will enable us to interact with the Agent via a simple REST API.

Conclusion

This has been a fun exploration of the Spring AI framework. I was able to build a simple Hotel Booking Agent that was able to handle a few different scenarios. As mentioned at the top of the article, Function Calling support is not available in the latest stable release. Compared to the LangChain4J framework, Spring AI is a bit more complex to get started with, and provides fewer out-of-the-box features. Furthermore, it appears that in its current form Function Calling isn’t very portable between models, but this will require further exploration.

One other quip I have with the Spring AI Framework is that we do not get as much insight or control into the function calling under the hood. With LangChain4J we can stash the function requests/responses in a chat MemoryStore which would enable agents to track their tool executions over a long form conversation. This doesn’t seem easily possible with the current SNAPSHOT version of Spring AI. The common example that I use for this problem is a bot being exposed to a tool that requires pagination. If the bot cannot track the pagination state (by seeing what it invoked awhile ago), it will get stuck in a loop requesting the first page.

In all, I’m excited to see where the Spring AI project goes. I think it has a lot of potential, and I’m looking forward to seeing how it evolves. The Spring team & community has a great track record of building robust, developer-friendly tools, and I’m excited to see what the future holds for Spring AI.

The full code for today’s project as well as a Postman collection to interact with the Agent can be found Here.

The Dawn of Semi-Autonomous E-Commerce Agents

2024-03-06T00:00:00-07:00

I recently published an article, The Dawn of Semi-Autonomous E-Commerce Agents for Commerce Architects, in which I explain what an Agent is, how they might be used in commerce, and how they might be implemented. The article is a high-level overview over Agents and how they might be used in commerce. It is not a technical deep dive.

Read the full article here.

Extending the Memory of Large Language Models

2024-02-19T00:00:00-07:00

With the recent OpenAI announcement adding memory to chatGPT, it seemed like a great time to write about how to add memory to large language models (LLMs). While the OpenAI announcement is new, giving LLMs persistent memory is not.

We will begin by covering the fundamental concepts of memory in LLMs, and then continue discussing some high-level strategies. This post will not include code examples and will be focused on how to conceptualize memory management for LLMs.

It is important to remember that this is a new and rapidly evolving field. The strategies discussed here are not exhaustive. Ultimately, the design of your LLM memory will depend on the specific requirements of your project. This article will set the stage for different types of Request Augmented Retrieval (RAG) strategies to be implemented in the context of extending the dynamic memory of LLMs.

Note: While LLMs are not limited to chatbots, I will be describing the memory strategies in the context of a chatbot to keep things simple.

Conceptualizing In-Context Memory

At this time, LLMs are limited to a fixed-size context window. This means that there is a limit to the amount of information that a model can process at a given time (measured in Tokens). With most LLM providers, the clients will maintain the context and send the existing context along with the latest input to the model.

This can be a long string, but it’s generally easier to envision the context window as an Array of messages between the model and the user.

Figure 1: Diagram of a context window

In the above diagram, the context window is represented as an array of messages. The most recent message is at the end of the array. The messages are ordered from oldest to newest. The “System Message” is almost always prevented from being removed from the context window as it will contain critical information to the model–typically profiling instructions that inform the LLM of its roll & desired behavior.

The Context Window memory is the “short term” memory of the LLM. It will typically be managed in memory on the application or on something like Redis.

Sliding Windows

A “sliding window” is a common strategy for managing the context window. This is where the context windows is maintained as a fixed size of messages or tokens. When the context window is full, the oldest message is removed to make room for the newest message.

This strategy is straightforward, but eventually results in the loss of the oldest context messages unless a mechanism is in place to persist them.

Reserved Indices

Before we can attach a persistent memory store to the LLM we will first need a place to put the data. As mentioned before, the context-window for Language Models is limited, so we will need to use a location within the context window to insert information fetched from a persistent memory store. If the application can identify relevant information from the persistent memory store, it can insert it into the context window at a reserved index for the model to use.

Figure 2: Reserved Indices in the Context Window

In the above diagram, we can see that the context window has been adjusted to include reserved indices. This effectively reduces the size of the context window for unabridged, unreserved messages–but it does provide a placeholder for messages that are retrieved from a persistent memory store.

Note: Some retrieval strategies will simply append the injected context messages at the end of the context window. I have had success with both strategies, but I personally prefer more granular control over the context window.

Persisting and Retrieving Context

When we write code to manage the context window, we can create mechanisms to persist & retrieve data that has been removed from the context window.

Figure 3: Persisting Context

Above, we can see the high-level approach for an application to balance the management of both “short-term” and “long-term” memory. As the context window fills up, the application can save the oldest messages to a persistent memory store.

Note: As mentioned at the top of this post, the implementation details depend on your use case. The persisted messages could be summaries, they could be converted to vector embeddings, they could be entries in a SQL database or Graph database, etc.

Figure 4: Retrieving Context

Figure 4 demonstrates the flip-side of the equation. When new messages are received, the application can query the persistent memory store for relevant messages and insert them into the context window at the reserved indices.

We finally made it to Retrieval Augmented Generation, or RAG! Above is the high-level retrieval mechanism. The existing context and new message can both be utilized to retrieve relevant information from the persistent memory store. The retrieved information is then injected into the context, giving the model access to a larger pool of information than it would have had otherwise.

Going (a little) Deeper

Now that we have a high-level understanding of memory management for LLMs, we can briefly touch on some more specific strategies in details.

Note: As a pattern, there is nothing stopping us from using an LLM to manage the persistence & retrieval of data. That is to say we can have one instance of an LLM that is responsible for managing the context window and another instance of an LLM that is responsible for having a conversation. I like to think of this as a “subconscious.” That is, a reasoning layer that helps the LLM make decisions about what to remember and what to forget.

Graph Databases: Graph databases can be enormously powerful, particularly if the use case of your LLM is to manage relationships between entities or concepts. As messages “slide” out of the context window, they can be stored in a graph for the LLM to query later.
Vector Embeddings: Vector embeddings are a natural fit for LLMs. The messages that are removed from the context window can be converted to vector embeddings. Messages that slide out of the context window are then converted into vector embeddings, stored in a vector database & then retrieved based on semantic relevance to user queries.
SQL Databases: SQL databases are a great candidate for storing unabridged messages that have been removed from the context window. Code can be written to fetch N number of messages from the persistent data store. An Agent can then evaluate those messages before being inserting into the context window (perhaps the information isn’t useful, or it needs to be summarized).

LLM applications with memory can be designed in a great many ways. If you’re building a autonomous agent instead of a chatbot, the design might be different. Instead of summarizing and retrieving conversation details, it instead might be saving lessons learned instead; For example, if a tool execution fails to align with a plan, the agent can save this information as a lesson to be retrieved before the next planning & execution cycle.

Conclusion

Today we covered the high-level concepts of adding memory to LLMs. We discussed how to conceptualize the context window, sliding windows, reserved indices & request augmented retrieval. We also briefly touched on some more specific strategies and how they might be implemented.

I hope that this article has provided a high-level understanding on how to add memory to LLMs. In future posts, I will be taking a deeper dive into implementing some of these strategies. Feel free to reach out on linkedin the contact page if you have any questions or comments.

Implementing the SELF-DISCOVER Algorithm in Java Spring with LangChain4J

2024-02-10T00:00:00-07:00

Google’s DeepMind project recently published “SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures” The paper proposes “a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems.” After reading the paper it was clear that the algorithm would be pretty easy to implement, especially with the help of LangChain4J, which is a Java LLM Integration framework that has proven to be dramatically more stable than the official Python LangChain framework.

Understanding the Algorithm

The algorithm is broken into two phases: Composition and Solving. The composition phase is further broken into three steps:

Select: The LLM is provided a task and a list of reasoning modules and is asked to select the most appropriate reasoning modules to solve the task. Each “reasoning module” is a string with text describing a problem-solving strategy.
Adapt: The LLM is provided the selected reasoning modules and the task. It is asked to adapt the selected reasoning modules to the task.
Implement: The LLM is provided the adapted reasoning modules The adapted reasoning modules are transformed into a step-by-step task specific reasoning structure.

(image from SELF-DISCOVER paper)

Pictured above is a visualization of the composition phase of SELF-DISCOVER. The second phase is rather straightforward, the LLM is simply handed the reasoning structure from the output of the composition phase and asked to solve the task.

You may have noticed from the graphic that the SELECT phase appears to require “Seed Modules.” Luckily, the authors of the paper have provided a bank of pre-existing reasoning modules that the LLM can select from, you can find them on Page 13, Table 2.

Implementation

Now that we have established how the algorithm works (and where to find a starter-bank of reasoning modules), we are ready to implement! You can find the full implementation on my GitHub. I’m going to cover the highlights here.

Dependencies, LangChain4J

The LangChain4J library has proven to be a valuable tool for integrating LLMs into Java applications. This library is far more stable than the official Python LangChain4J. Below are the 3 LangChain4J dependencies that I used for this project:

pom.xml

       
            dev.langchain4j
            langchain4j
            ${langchain4j.version}
        
            dev.langchain4j
            langchain4j-open-ai-spring-boot-starter
            ${langchain4j.version}
        
            dev.langchain4j
            langchain4j-embeddings-all-minilm-l6-v2
            ${langchain4j.version}

Reasoning Modules

The paper provides a bank of “reasoning modules” which are really just a list of adapted strategies for solving problems. As the reasoning bank is just a list of strings, I opted to configure them in the application.yml and create a corresponding spring @ConfigurationProperties class to load them into the application.

Below is a snippet of the application.yml. Reviewing some of the entries in the reasoning bank may provide a clearer view into how the algorithm works.

application.yml

openai:
  api-key: ${OPENAI_API_KEY}

reasoning:
  modules:
    - How could I devise an experiment to help solve that problem?
    - Make a list of ideas for solving this problem, and apply them one by one to the problem to see if any progress can be made.
    - How could I measure progress on this problem?
    - How can I simplify the problem so that it is easier to solve?
    - What are the key assumptions underlying this problem?
    - What are the potential risks and drawbacks of each solution?

As promised, the corresponding Configuration class:

@Configuration
@ConfigurationProperties(prefix = "reasoning")
public class ReasoningModuleConfig {

    private List<String> modules;

    public List<String> getReasoningModules() {
        return modules;
    }

    public void setModules(List<String> modules) {
        this.modules = modules;
    }

}

When the application starts, the ReasoningModuleConfig class will be populated with the reasoning modules from the application.yml file that we defined. This also makes it easy to extend the reasoning bank in the future.

LangChain AIService SELF-DISCOVER Interface

What a mouthful! The AIService is a LangChain4J construct. We can define an interface, utilize some special LangChain4J annotations to help guide behavior, and then via the AIService.builder() method, we can pass a LanguageModel (openAI in this case) and create an AIService. These AIServices can also be equipped with tools, chat memory, and other features.

I define a method for each step in the SELF-DISCOVER algorithm.

Select

Below is a snippet of the SelfDiscovery interface. The @UserMessage annotation guides the LLM on how to respond to the prompt. The @V annotations are used by LangChain4J to map the variables in the prompt to the method parameters. As described by the paper, the 1st step is to select reasoning modules that will help solve a given task.

public interface SelfDiscovery {

    /**
     * Selects reasoning modules that will help solve a task.
     * @param task
     * @param allReasoningModules
     * @return
     */
    @UserMessage({
            "Select several reasoning modules that are crucial to utilize in order to solve the given task.",
            "Do not explain your reasoning, simply list the reasoning modules that you select.",

            "GIVEN TASK:",
            "",
            "---",
            "AVAILABLE REASONING MODULES:",
            "",
    })
    public String selectModules(@V("task") String task,
                                      @V("allReasoningModules") List<String> allReasoningModules);
...

It is worth noting at this time that @UserMessage appears to be the only annotation in the LangChain4J frameowrk capable of handling multiple variables.

Adapt

The next step is to adapt the selected reasoning modules to the given task. This is done by providing the LLM with the selected modules and requesting that it adapt them to the task.

...
    /**
     * Adapts each reasoning module to better help solve the task.
     * @return
     */
    @UserMessage({
            "Rephrase and specify each reasoning module so that it better helps solving the task:",
            "Do not explain your reasoning or solve the task, simply adapt each selected reasoning module to better help solve the task.",

            "GIVEN TASK:",
            "",
            "---",
            "SELECTED REASONING MODULES:",
            "",
    })
    public String adaptModules(@V("task") String task,
                                     @V("selectedReasoningModules") String selectedReasoningModules);
...

The output of this method will be a list of adapted reasoning modules that are better suited to solving the task.

Implement

The final step in the compoisition phase is to implement the adapted reasoning modules into a step-by-step reasoning structure. The paper provided some hints at the prompt for this step,

...
    /**
     * Implement a reasoning structure for solvers to follow step-by-step to arrive at a correct solution.
     * @return
     */
    @UserMessage({
            "Transform the reasoning modules into a step-by-step reasoning plan in JSON format.",
            "Do not explain your reasoning or solve the task, simply create an actionable reasoning plan",
            "for solvers solve using these adapted reasoning modules..",

            "GIVEN TASK:",
            "",
            "---",
            "ADAPTED REASONING MODULES:",
            "",
    })
    public String implement(@V("task") String task,
                            @V("adaptedReasoningModules") String adaptedReasoningModules);
...

When this final method escapes, there should be a JSON formatted reasoning plan that can be used to solve the task. This reasoning plan can be passed to other LLMs along with the task to solve the problem. It is worth noting that the authors of the SELF-DISCOVER method experimented with the portability of these derived reasoning structures. That is, they could have one LLM compose the reasoning structure and then pass it to another LLM to solve the task and still achieve an improvement in performance.

All Together Now

Now that we have defined the essential components of the SELF-DISCOVER algorithm, we can put them all together and take this for a spin. I’ll create a ReasoningService class that will orchestrate the composition and solving of tasks.

ReasoningService.java

@Service
@RequiredArgsConstructor
@Slf4j
public class ReasoningService {

    private final ReasoningModuleConfig reasoningModuleConfig;
    private final SelfDiscovery selfDiscovery;
    private final Solving solving;

...

The reasoning service is a Spring @Service that is injected with the ReasoningModuleConfig,SelfDiscovery and Solving AIServices. The SelfDiscovery and Solving AIServices are interfaces that we defined earlier, together they represent both phases of the SELF-DISCOVER algorithm. By the way, if you’re curious about how these are initialized check out this snippet

Here is the snippet that demonstrates the composition of the reasoning structure:

    /**
     * Orchestrates the SelfDiscover AIService, which contains prompts that implement the SELF-DISCOVER algorithm.
     * The `SelfDiscover` AIService composes task-specific reasoning structures for solvers to follow step-by-step to arrive at a solution.
     * @param task
     * @return Reasoning structure composed by the SelfDiscover AIService
     */
    public String composeReasoningStructure(String task) {
        log.info("Composing reasoning structure for task: {}", task);
        String selectedReasoningModules = selfDiscovery.selectModules(task, reasoningModuleConfig.getReasoningModules());
        log.info("Selected reasoning modules: {}", selectedReasoningModules);
        String adaptedReasoningModules = selfDiscovery.adaptModules(task, selectedReasoningModules);
        log.info("Adapted reasoning modules: {}", adaptedReasoningModules);
    
        // Operationalize the reasoning modules into a step-by-step reasoning plan
        String reasoningPlan = selfDiscovery.implement(task, adaptedReasoningModules);
        log.info("Reasoning plan: {}", reasoningPlan);
    
        return reasoningPlan;
    }

And finally, here is the snippet that demonstrates the solving of the task using the reasoning structure:

    ...
    /**
     * Using the self-composed reasoning structure, solve the given task.
     * @param task
     * @param composedReasoningStructure
     * @return
     */
    public String solveTask(String task, String composedReasoningStructure) {
        // This response contains the answer and likely some other information
        String reasonedAnswer = solving.solveTask(task, composedReasoningStructure);
        // Extract the answer from the reasoned solution
        return solving.extractAnswer(reasonedAnswer);
    }
...

If you want to see the full implementation, you can find it on my GitHub To easily see the algorithm in action, I’ve created a set of tests that demonstrate the algorithm in action. You can find them here

The Bigger Picture

Anecdotally, one of the patterns emerging in LLM dev & agent design world is that specialization and focused operations are key to achieving high performance.

It is a common pattern to have a delegator or orchestrator Agent in the system that is responsible for breaking down a problem into smaller tasks that are then delegated to specialized worker agents to execute.

Having a new algorithm like SELF-DISCOVER available may be a game changer for the Planning module orchestration agents.

To read more about modules and agent design, check out this paper

Conclusion

I hope that this blog has helped make the SELF-DISCOVER algorithm more accessible. I’m excited to refine this implementation and then try it out in a real-world application. More than likely, I’ll be using it as part of a Planning module for orchestration agent.

If you have any questions or comments, feel free to reach out to me on Linkedin

Organizing Home Storage with Python, QR Codes & Notion

2023-12-28T00:00:00-07:00

Edit 02/13/24: - Now that Notion has released notion.ai, this strategy is even more powerful. I can just ask a bot inside Notion where my stuff is. —

I’ve had a nomadic couple of years. Having moved about four times in as many years, I’ve had to pack and unpack my life a few times. It’s a great way to prune your belongings. It’s also a great way to lose track of things.

The Problem

Over time, the contents of our storage boxes shift. A box that was once full of C++ books may now be full of python books. A box that used to contain Playstation 4 games and accessories may someday contain Playstation 5 games and accessories. Point being, the more detailed we are about the contents on the box the more likely those details will become innacurate over time.

The Solution

I had been toying with the idea of using QR codes to label boxes for a while. Some time ago, I played around with a python library to get an idea of how it worked and what types of information I could encode.

Initially I had thought to simply encode the contents within the QR code, but I realized this would be pointless. I could just as easily write the contents on the box.

The real value would be in having the QR code link to a database that would be updatable.

Additionally, storing box information in a database would allow me to save location details of the box itself. So, I could search by contents, identify where the box is located & then go fetch it from my garage.

Notion

I decided to use Notion as my database. I have been using it for some time and I’ve enjoyed it. Plus, I already have the application on my phone, so I can easily scan the QR code which will take me to the Notion page for the box I’m working with.

The 1st step is to prepare the notion database. I created a new page with a table that has the following columns: Name, Tags, Content, QR Code, Type, Location.

Python & QR Code

Once the database is created an an entry exists, it’s time to write some python code:

import qrcode

qr_data = {
    "BX0001": "https://www.notion.so/jsosoka/BX0001-fbe3a46c857d4e36a660ed1db94cb09a?pvs=4",
    "BX0002": "https://www.notion.so/jsosoka/BX0002-c7b91e641e1f4565ba104efec3f50f68?pvs=4",
    "BX0003": "https://www.notion.so/jsosoka/BX0003-3bf5f8b21cb94276a95da590bc97ffaa?pvs=4",
    "BX0004": "https://www.notion.so/jsosoka/BX0004-5e7f9173043c49f396d2489cda1724ba?pvs=4",
    "BX0005": "https://www.notion.so/jsosoka/BX0005-1432526864ea475fade4931b4d16eeca?pvs=4",
    "BX0006": "https://www.notion.so/jsosoka/BX0006-c1d99d3db3a84a7e86432b5914b7e39c?pvs=4",
    "BX0007": "https://www.notion.so/jsosoka/BX0007-731956b5b18f41ad94c44e3642ffff31?pvs=4",
    "BX0008": "https://www.notion.so/jsosoka/BX0008-80cec0b208a940f0a1e77375de1fc137?pvs=4",
    "BX0009": "https://www.notion.so/jsosoka/BX0009-066f4a9ee155414794874a9d9496efd9?pvs=4",
    "BX0010": "https://www.notion.so/jsosoka/BX0010-a14e90dc5b9b40eb8e036161244a0543?pvs=4",
    "BX0011": "https://www.notion.so/jsosoka/BX0011-57cf5e8813d74cb2b342f2aff19bbc9e?pvs=4",
    "BX0012": "https://www.notion.so/jsosoka/BX0012-1beedb8f58b44cf29667471202992754?pvs=4",
    "BX0013": "https://www.notion.so/jsosoka/BX0013-1600ebab14744720a987fff40afd4d49?pvs=4",
    "BX0014": "https://www.notion.so/jsosoka/BX0014-29cad5980f6642419db0c46de0891f14?pvs=4",
    "BX0015": "https://www.notion.so/jsosoka/BX0015-4e5e5f9374a24bad91a43fc86b6f87d7?pvs=4"
}

# Loop through the dictionary and generate QR codes for each entry
for key, url in qr_data.items():
    # Generate QR code
    qr = qrcode.QRCode(
        version=1,
        error_correction=qrcode.constants.ERROR_CORRECT_L,
        box_size=10,
        border=4,
    )
    qr.add_data(url)
    qr.make(fit=True)
    qr_image = qr.make_image(fill_color="black", back_color="white")

    # Save the QR code image
    qr_image.save(key + ".png")

    print(f"QR code for {key} generated and saved as '{key}.png'")

You’ll notice that in the above, I’m creating a dictionary with entries corresponding to each box. The key is the box name/id and the value is the URL to the Notion page for that box.

The Result

Once al of the QR codes have been generated, I tossed them into a Microsoft Word document, added a human readable label, and printed.

After cutting the codes, I taped them to boxes.

I spent some time in the garage, with my tablet attaching the QR codes to boxes & updating the database entries in Notion.

This was a fun project that only took a couple of hours. It was an idea that I had been kicking around for some time, ideally it will enable me to find things in my garage without having to dig through boxes. I haven’t been using this system long enough to know if it will be useful, but I’m optimistic and hope that sharing this will help others.

Appreciating this Moment in Tech

2023-12-01T00:00:00-07:00

I’m reading Symphony of Thought by David Shapiro this morning. He has been experimenting with building thinking machines.

I have been having SO MUCH FUN these past few months building thinking machines. Working through how they can be logically organized, how to enhance them with memory or split up responsibility between multiple semi-autonomous agents.

A few weeks ago I had an agent generating tasks for another agent to execute on–that was so much fun to watch in action.

It is really a fun time to be working in this space.

I just wanted to take a moment and appreciate it. The technology is advancing so rapidly that it’s unlikely that this sort of work will be as fun, intriguing or necessary in a couple of years.