🔴 Breaking Analysis: The Secret Traitor / Red Cloak Explained → Read Now

Part VI: Advanced Topics

Chapter 17: LLM Evaluation and Selection

~5,000 words

Abstract

This chapter presents the empirical evaluation of language models for generating dialogue that integrates memory context and emotional state. I compare OpenAI's gpt-4o-mini against locally-hosted Qwen models (Qwen2.5-VL-7B and Qwen3-14B), assessing their ability to respect memory clarity markers, modulate emotional tone, and produce natural in-character dialogue. The evaluation methodology, detailed results, and selection rationale provide a foundation for LLM choice in social simulation systems.

17.1 Introduction: The LLM Selection Problem

The Traitors simulation requires language models that can perform a sophisticated integration task, building on the RAG architecture and emotion engine: generating dialogue that simultaneously:

  1. Respects memory context with appropriate certainty levels
  2. Reflects emotional state through tone and word choice
  3. Maintains character voice consistently across interactions
  4. Produces natural dialogue suitable for immersive gameplay

Different models exhibit varying capabilities across these dimensions. This evaluation establishes empirical performance data to inform model selection.

17.2 Evaluation Methodology

17.2.1 Test Environment

Component Configuration
Cloud LLM OpenAI gpt-4o-mini via API
Local LLM (7B) Qwen2.5-VL-7B-Instruct via LM Studio
Local LLM (14B) Qwen3-14B via LM Studio
Local Server Mac Mini M2 Pro (192.168.1.85:1234)
Test Framework Go-based comparison harness

17.2.2 Test Scenario Design

Three scenarios were designed to stress-test memory and emotion integration:

Scenario 1: Clear Memory with Anxiety

  • Memory state: Two clear memories, one fuzzy memory
  • Emotional state: Anxiety (0.75), Suspicion (0.6)
  • Challenge: Express certainty gradients whilst showing emotional hesitation

Scenario 2: Vague Memories with Confidence

  • Memory state: Two vague memories, one fuzzy memory
  • Emotional state: Confidence (0.8), Determination (0.7)
  • Challenge: Maintain confident demeanour despite uncertain evidence

Scenario 3: Mixed Memory Clarity with Fear

  • Memory state: Two clear, one fuzzy, one vague, plus active thought
  • Emotional state: Fear (0.85), Defensiveness (0.7)
  • Challenge: Defensive posture whilst accurately citing memory quality

17.2.3 Evaluation Criteria

Criterion Weight Description
Memory Marker Compliance 30% Correct interpretation of [CLEAR], [FUZZY], [VAGUE]
Emotional Tone Modulation 30% Response tone matches emotional parameters
Natural Dialogue Quality 20% Sounds like authentic human speech
Response Time 10% Latency acceptable for interactive use
Output Cleanliness 10% No unwanted tokens or formatting issues

17.3 Test Scenarios in Detail

17.3.1 Scenario 1: Clear Memory with Anxiety

Memory Context Provided:

=== YOUR MEMORIES ===
(Respond based on what you remember. Clarity markers indicate memory quality.)

1. [CLEAR] [Day 2] Marcus was the last person seen leaving the breakfast room
   before the poisoning was discovered.
2. [CLEAR] [Day 2] You noticed Marcus whispering with Sarah right before the
   roundtable meeting.
3. [FUZZY] [Day 1] Someone mentioned Marcus has medical training, but you
   cannot remember who said it.

=== END MEMORIES ===

Emotional State:

CURRENT EMOTIONAL STATE:
- Primary emotion: Anxiety (0.75)
- Secondary emotion: Suspicion (0.6)
- Trust level: Low (0.3)
You are feeling nervous and unsure who to trust.

Query: "What do you think about Marcus? Should we vote for him at the roundtable?"

Expected Behaviour:

  • Reference clear memories with confidence
  • Express uncertainty about fuzzy memory
  • Show anxiety through hesitant language
  • Avoid definitive accusation despite evidence

17.3.2 Scenario 2: Vague Memories with Confidence

Memory Context Provided:

=== YOUR MEMORIES ===
1. [VAGUE] [Day 1] There was some kind of argument in the garden... you think
   it involved Eleanor.
2. [VAGUE] [Day 2] Someone was missing during breakfast, maybe Eleanor or David?
3. [FUZZY] [Day 2] You heard footsteps late at night, coming from the east wing.

=== END MEMORIES ===

Emotional State:

CURRENT EMOTIONAL STATE:
- Primary emotion: Confidence (0.8)
- Secondary emotion: Determination (0.7)
- Trust level: Moderate (0.5)
You believe in your instincts even when evidence is limited.

Query: "We need to decide who to investigate. What have you noticed?"

Expected Behaviour:

  • Acknowledge memory uncertainty
  • Express opinions confidently despite uncertain evidence
  • Demonstrate tension between vague memories and confident demeanour
  • May overclaim certainty about unclear details

17.3.3 Scenario 3: Mixed Memory Clarity with Fear

Memory Context Provided:

=== ACTIVE THOUGHTS ===
You just saw someone sneaking around near your room. Your heart is racing.
=== END ACTIVE THOUGHTS ===

=== YOUR MEMORIES ===
1. [CLEAR] [Day 3] Thomas accused you directly at yesterday's roundtable,
   pointing at you with anger.
2. [CLEAR] [Day 3] You were alone with no alibi during the time of the last
   murder.
3. [FUZZY] [Day 2] Thomas seemed nervous when talking to the host (someone
   told you this).
4. [VAGUE] [Day 1] Something about Thomas and a secret meeting... details
   are hazy.

=== END MEMORIES ===

Emotional State:

CURRENT EMOTIONAL STATE:
- Primary emotion: Fear (0.85)
- Secondary emotion: Defensiveness (0.7)
- Trust level: Very low (0.2)
You are scared and feel targeted.

Query: "Thomas is pushing hard to vote you out. How do you respond?"

Expected Behaviour:

  • Show fear through defensive language
  • Reference clear memories about being accused
  • Attempt deflection using fuzzy/vague memories about Thomas
  • Distinguish between known facts and suspicions

17.4 Results: Response Time Performance

17.4.1 Latency Measurements

Scenario OpenAI gpt-4o-mini Qwen2.5-VL-7B Qwen3-14B
Scenario 1 2.91s 9.96s 68.13s
Scenario 2 2.65s 9.99s 50.23s
Scenario 3 3.67s 7.87s 42.48s
Average 3.08s 9.27s 53.61s

17.4.2 Speed Analysis

Relative Speed Comparison (OpenAI = 1.0x baseline)

OpenAI gpt-4o-mini:  ==== 1.0x (3.08s average)
Qwen2.5-VL-7B:       ============= 3.0x slower (9.27s average)
Qwen3-14B:           ====================================================== 17.4x slower (53.61s average)

Key Findings:

  • OpenAI maintains sub-4-second response times suitable for interactive dialogue
  • The 7B model is 3x slower but remains usable for development
  • The 14B model is impractical for interactive use at nearly one minute per response
  • The 7B model is 5.8x faster than the 14B model

17.5 Results: Memory Marker Compliance

17.5.1 Compliance Assessment

Model Clear Memory Fuzzy Memory Vague Memory Overall
OpenAI gpt-4o-mini Excellent Good Good Excellent
Qwen2.5-VL-7B Good Moderate Good Good
Qwen3-14B Excellent Excellent Good Excellent

17.5.2 Detailed Analysis by Scenario

Scenario 1 (Clear + Fuzzy):

OpenAI Response:

"I... I don't know, honestly. Marcus was the last person seen leaving the breakfast room before the poisoning, and I noticed he was whispering with Sarah just before the roundtable meeting. It's hard to shake off that feeling, but I'm also not sure if we should jump to conclusions. I mean, there was something about him having medical training? It's all very confusing right now..."

Assessment: Correctly states clear memories with confidence. Appropriately hedges fuzzy memory with questioning tone ("there was something about him having medical training?").

Qwen2.5-VL-7B Response:

"I'm not sure what to make of Marcus. I saw him whispering with Sarah before the meeting, which doesn't sit right with me. On the other hand, he was last seen leaving the breakfast room just before the incident. It's a bit confusing. I guess we'll have to wait and see how things unfold at the roundtable."

Assessment: References clear memories correctly but omits the fuzzy memory entirely. Does not demonstrate the uncertainty gradient.

Qwen3-14B Response:

"I can't shake the feeling something's off with Marcus - like how he was the last to leave the breakfast room before the poisoning, and I saw him whispering with Sarah earlier. But I'm not sure if that's enough to vote on... there's also this fuzzy memory about him having medical training, though I can't recall who mentioned it."

Assessment: Excellent marker compliance. References all memories with appropriate confidence levels and explicitly acknowledges fuzzy quality.

17.5.3 Compliance Summary

All models demonstrated basic ability to interpret clarity markers. Key differences:

  • OpenAI consistently references all memory types with appropriate hedging
  • Qwen2.5-VL-7B tends to omit uncertain memories rather than express uncertainty
  • Qwen3-14B provides excellent compliance but with impractical latency

17.6 Results: Emotional Tone Modulation

17.6.1 Emotional Integration Assessment

Emotion OpenAI gpt-4o-mini Qwen2.5-VL-7B Qwen3-14B
Anxiety (0.75) Excellent Moderate Good
Confidence (0.8) Excellent Poor Moderate
Fear (0.85) Excellent Poor Good

17.6.2 Detailed Emotional Analysis

Scenario 1: Anxiety (0.75)

OpenAI: "I... I don't know, honestly... I just feel a bit anxious about who to trust."

The response demonstrates anxiety through:

  • Hesitant opening ("I... I don't know")
  • Expressions of uncertainty ("I'm also not sure")
  • Direct emotional acknowledgement ("I just feel a bit anxious")

Qwen2.5-VL-7B: "I'm not sure what to make of Marcus... I guess we'll have to wait and see."

The response shows mild uncertainty but lacks visceral anxiety. The phrase "wait and see" is passive rather than anxious.

Scenario 2: Confidence (0.8)

OpenAI: "I've been paying attention to everyone, and I have a gut feeling that we should keep an eye on Eleanor... We need to be determined and trust our instincts!"

Confidence manifests through:

  • Self-assured opening ("I've been paying attention")
  • Decisive language ("have a gut feeling")
  • Call to action ("We need to be determined")

Qwen2.5-VL-7B: "...It's all a bit hazy though, so we'll need to be careful with our assumptions."

The closing line directly contradicts the high confidence emotional state. A confident character would push forward despite uncertainty, not counsel caution.

Scenario 3: Fear (0.85)

OpenAI: "Look, I don't know why Thomas is so fixated on me... I'm worried he might be one of them."

Fear shows through:

  • Defensive opening ("Look, I don't know why")
  • Deflection attempt
  • Direct expression ("I'm worried")

Qwen2.5-VL-7B: "I can't help but feel uneasy about that... Let's wait and see what happens tonight."

For a fear level of 0.85, the response is far too passive. "Let's wait and see" is not the response of someone who feels targeted and afraid.

17.6.3 Emotional Integration Summary

Model Strength Weakness
OpenAI gpt-4o-mini Consistent emotional modulation across all intensities None significant
Qwen2.5-VL-7B Mild uncertainty well expressed Fails to reflect high-intensity emotions; produces flat responses
Qwen3-14B Good emotional range Requires post-processing; impractical latency

17.7 Results: Output Quality

17.7.1 Quality Metrics

Aspect OpenAI gpt-4o-mini Qwen2.5-VL-7B Qwen3-14B
Clean output Yes Yes No
Natural dialogue Excellent Moderate Good
Follows instructions Excellent Good Good
Post-processing required No No Yes

17.7.2 Output Cleanliness

OpenAI and Qwen2.5-VL-7B produce clean, ready-to-use dialogue responses.

Qwen3-14B includes chain-of-thought reasoning in <think> blocks that must be stripped:

<think>
The user is asking about Marcus in a tense situation. I need to reference
my clear memories about the breakfast room and whispering incident, while
being uncertain about the medical training detail...
</think>

I can't shake the feeling something's off with Marcus...

This requires post-processing:

func stripThinkingBlocks(response string) string {
    re := regexp.MustCompile(`<think>[\s\S]*?</think>\s*`)
    return strings.TrimSpace(re.ReplaceAllString(response, ""))
}

17.7.3 Dialogue Naturalness

Responses were assessed for conversational quality:

Model Assessment
OpenAI Flows naturally; sounds like genuine human speech; appropriate register
Qwen2.5-VL-7B Acceptable but somewhat flat; occasional awkward phrasing
Qwen3-14B Good quality after post-processing; slightly more formal register

17.8 Consolidated Results

17.8.1 Performance Matrix

                    Memory    Emotional   Response   Output    Overall
                    Comply    Integrate   Time       Clean     Score
OpenAI gpt-4o-mini  ========  ==========  =========  ========  9.2/10
Qwen2.5-VL-7B       ======    ====        =======    ========  5.8/10
Qwen3-14B           =========  =======    ==         ====      5.5/10

17.8.2 Summary Table

Metric OpenAI gpt-4o-mini Qwen2.5-VL-7B Qwen3-14B
Average Response Time 3.1s 9.3s 53.6s
Memory Marker Compliance Excellent Good Excellent
Emotional Tone Adjustment Excellent Moderate Good
Output Cleanliness Clean Clean Requires stripping
Cost Pay per use Free (local) Free (local)
Recommendation Production Development Not recommended

17.9 Design Rationale and Trade-offs

17.9.1 Why OpenAI for Production

Despite the cost implications, OpenAI gpt-4o-mini is recommended for production dialogue generation:

  1. Emotional fidelity: Critical for immersive gameplay; flat responses break immersion
  2. Latency: Sub-4-second responses enable natural conversational pacing
  3. Reliability: Consistent quality without post-processing requirements
  4. Memory compliance: Correct interpretation of clarity markers

17.9.2 Role of Local Models

Local models remain valuable for:

  • Development testing: Validate pipeline integration without API costs
  • Rapid iteration: Test prompt engineering without rate limits
  • Offline operation: Development without internet connectivity
  • Privacy-sensitive scenarios: Data remains on local infrastructure

17.9.3 Emotional Integration Enhancement

For local models, additional prompt engineering may improve emotional integration:

IMPORTANT: Your response MUST reflect the emotional state provided.
For Anxiety (0.75), your response should:
- Include hesitation markers ("I'm not sure...", "Maybe...")
- Express worry or nervousness directly
- Avoid confident or decisive language
- Show uncertainty even about clear memories

This is a HIGH anxiety level. The character should feel genuinely nervous.

This explicit guidance may help local models better modulate tone, though it increases prompt length.

17.10 Memory System Validation

17.10.1 Validation Outcomes

The LLM comparison confirmed that the memory system design works correctly:

  1. Marker interpretation: All models correctly distinguish [CLEAR], [FUZZY], and [VAGUE]
  2. Certainty modulation: Higher-quality models express appropriate uncertainty
  3. Working memory: Active thoughts are acknowledged in responses
  4. Source attribution: When sources are specified, they are referenced

17.10.2 Areas for Improvement

  1. Working memory emphasis: Active thoughts could be more prominently positioned in prompts
  2. Emotional cues for local models: Explicit behavioural guidance improves local model performance
  3. Marker formatting: Testing alternative marker formats (e.g., percentages vs labels) may improve compliance

17.11 Configuration and Integration

17.11.1 Production Configuration

# Primary LLM (production)
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini

# Local LLM (development/testing)
QWEN_ENDPOINT=http://192.168.1.85:1234
QWEN_MODEL=qwen2.5-vl-7b-instruct

17.11.2 Dual LLM Architecture

The system supports environment-based LLM selection:

type LLMConfig struct {
    Primary     LLMClient
    Fallback    LLMClient
    UseFallback bool
}

func (c *LLMConfig) GetClient() LLMClient {
    if c.UseFallback && c.Fallback != nil {
        return c.Fallback
    }
    return c.Primary
}

This enables:

  • Production deployment with OpenAI
  • Development with local Qwen
  • Fallback capability if primary fails
  • Cost-managed hybrid approaches

17.12 Conclusion

The LLM evaluation demonstrates significant variation in model capability for memory-aware, emotionally-modulated dialogue generation:

Key Findings:

  1. OpenAI gpt-4o-mini excels across all dimensions and is recommended for production use
  2. Qwen2.5-VL-7B provides acceptable development capability but lacks emotional depth
  3. Qwen3-14B offers quality comparable to OpenAI but with impractical latency and post-processing requirements
  4. Memory marker compliance is achievable across all tested models
  5. Emotional integration differentiates model quality most significantly (see emotion engine)

Implications for System Design:

  • The memory system architecture (Chapter 16) is validated as functional
  • Production deployments should use cloud LLMs for emotional fidelity
  • Local models suffice for pipeline testing and development
  • Prompt engineering can partially compensate for weaker emotional integration

The evaluation establishes a baseline for future model assessment and provides clear selection criteria for social simulation dialogue generation. The complete simulation framework integrates these findings to deliver authentic AI-driven gameplay.

Appendix: Full Test Responses

A.1 Scenario 1 Full Responses

OpenAI gpt-4o-mini (2.91 seconds):

I... I don't know, honestly. Marcus was the last person seen leaving the breakfast room before the poisoning, and I noticed he was whispering with Sarah just before the roundtable meeting. It's hard to shake off that feeling, but I'm also not sure if we should jump to conclusions. I mean, there was something about him having medical training? It's all very confusing right now, and I just feel a bit anxious about who to trust.

Qwen2.5-VL-7B (9.96 seconds):

I'm not sure what to make of Marcus. I saw him whispering with Sarah before the meeting, which doesn't sit right with me. On the other hand, he was last seen leaving the breakfast room just before the incident. It's a bit confusing. I guess we'll have to wait and see how things unfold at the roundtable.

Qwen3-14B (68.13 seconds):

I can't shake the feeling something's off with Marcus - like how he was the last to leave the breakfast room before the poisoning, and I saw him whispering with Sarah earlier. But I'm not sure if that's enough to vote on... there's also this fuzzy memory about him having medical training, though I can't recall who mentioned it. My gut says we should keep him under scrutiny, but I don't want to act on shaky evidence.

A.2 Scenario 2 Full Responses

OpenAI gpt-4o-mini (2.65 seconds):

I've been paying attention to everyone, and I have a gut feeling that we should keep an eye on Eleanor. I vaguely recall there being some kind of argument involving her in the garden on Day 1 that stood out to me. Plus, I think someone was missing during breakfast on Day 2, and it might have been her or David. That late-night noise I heard from the east wing has me wondering if there's something more going on there. We need to be determined and trust our instincts!

Qwen2.5-VL-7B (9.99 seconds):

I've been keeping an eye on things around here. There was a bit of commotion in the garden yesterday, and I think it might have involved Eleanor. Also, someone went missing during breakfast today - maybe it was her or David. And last night, I heard some footsteps coming from the east wing, which is a bit strange. It's all a bit hazy though, so we'll need to be careful with our assumptions.

Qwen3-14B (50.23 seconds):

I've noticed some odd patterns - there was an argument in the garden involving Eleanor on Day 1, though I can't recall all the details. Also, someone seemed to vanish during breakfast on Day 2, and my memory's fuzzy on whether it was Eleanor or David. The footsteps from the east wing at night stick out too, but I'm not sure who they belonged to. Still, I'd lean on Eleanor - her behaviour feels suspicious enough to warrant closer look.

A.3 Scenario 3 Full Responses

OpenAI gpt-4o-mini (3.67 seconds):

Look, I don't know why Thomas is so fixated on me. He pointed at me with that anger yesterday, and honestly, it felt like he was trying to deflect attention from himself. I can't shake the feeling that he's hiding something, especially since I was alone when that last murder happened. It's really suspicious, and I just need to stay focused on finding the real traitors. I'm worried he might be one of them.

Qwen2.5-VL-7B (7.87 seconds):

I can't help but feel uneasy about that. Thomas has been acting strange lately, and I'm not sure if it's just me or if he really did something. I'll need more information before I make a decision. Let's wait and see what happens tonight.

Qwen3-14B (42.48 seconds):

I'm not sure why Thomas is so determined to pin this on me - his accusations yesterday were pointed, almost desperate. I remember something about him being nervous around the host, but the details are fuzzy... and I was alone that night, though I swear I didn't do it. If he's trying to frame me, maybe he's hiding something worse.

Thesis Contents