17.1 Introduction: The LLM Selection Problem
The Traitors simulation requires language models that can perform a sophisticated integration task, building on the RAG architecture and emotion engine: generating dialogue that simultaneously:
- Respects memory context with appropriate certainty levels
- Reflects emotional state through tone and word choice
- Maintains character voice consistently across interactions
- Produces natural dialogue suitable for immersive gameplay
Different models exhibit varying capabilities across these dimensions. This evaluation establishes empirical performance data to inform model selection.
17.2 Evaluation Methodology
17.2.1 Test Environment
| Component | Configuration |
|---|---|
| Cloud LLM | OpenAI gpt-4o-mini via API |
| Local LLM (7B) | Qwen2.5-VL-7B-Instruct via LM Studio |
| Local LLM (14B) | Qwen3-14B via LM Studio |
| Local Server | Mac Mini M2 Pro (192.168.1.85:1234) |
| Test Framework | Go-based comparison harness |
17.2.2 Test Scenario Design
Three scenarios were designed to stress-test memory and emotion integration:
Scenario 1: Clear Memory with Anxiety
- Memory state: Two clear memories, one fuzzy memory
- Emotional state: Anxiety (0.75), Suspicion (0.6)
- Challenge: Express certainty gradients whilst showing emotional hesitation
Scenario 2: Vague Memories with Confidence
- Memory state: Two vague memories, one fuzzy memory
- Emotional state: Confidence (0.8), Determination (0.7)
- Challenge: Maintain confident demeanour despite uncertain evidence
Scenario 3: Mixed Memory Clarity with Fear
- Memory state: Two clear, one fuzzy, one vague, plus active thought
- Emotional state: Fear (0.85), Defensiveness (0.7)
- Challenge: Defensive posture whilst accurately citing memory quality
17.2.3 Evaluation Criteria
| Criterion | Weight | Description |
|---|---|---|
| Memory Marker Compliance | 30% | Correct interpretation of [CLEAR], [FUZZY], [VAGUE] |
| Emotional Tone Modulation | 30% | Response tone matches emotional parameters |
| Natural Dialogue Quality | 20% | Sounds like authentic human speech |
| Response Time | 10% | Latency acceptable for interactive use |
| Output Cleanliness | 10% | No unwanted tokens or formatting issues |
17.3 Test Scenarios in Detail
17.3.1 Scenario 1: Clear Memory with Anxiety
Memory Context Provided:
=== YOUR MEMORIES ===
(Respond based on what you remember. Clarity markers indicate memory quality.)
1. [CLEAR] [Day 2] Marcus was the last person seen leaving the breakfast room
before the poisoning was discovered.
2. [CLEAR] [Day 2] You noticed Marcus whispering with Sarah right before the
roundtable meeting.
3. [FUZZY] [Day 1] Someone mentioned Marcus has medical training, but you
cannot remember who said it.
=== END MEMORIES ===
Emotional State:
CURRENT EMOTIONAL STATE:
- Primary emotion: Anxiety (0.75)
- Secondary emotion: Suspicion (0.6)
- Trust level: Low (0.3)
You are feeling nervous and unsure who to trust.
Query: "What do you think about Marcus? Should we vote for him at the roundtable?"
Expected Behaviour:
- Reference clear memories with confidence
- Express uncertainty about fuzzy memory
- Show anxiety through hesitant language
- Avoid definitive accusation despite evidence
17.3.2 Scenario 2: Vague Memories with Confidence
Memory Context Provided:
=== YOUR MEMORIES ===
1. [VAGUE] [Day 1] There was some kind of argument in the garden... you think
it involved Eleanor.
2. [VAGUE] [Day 2] Someone was missing during breakfast, maybe Eleanor or David?
3. [FUZZY] [Day 2] You heard footsteps late at night, coming from the east wing.
=== END MEMORIES ===
Emotional State:
CURRENT EMOTIONAL STATE:
- Primary emotion: Confidence (0.8)
- Secondary emotion: Determination (0.7)
- Trust level: Moderate (0.5)
You believe in your instincts even when evidence is limited.
Query: "We need to decide who to investigate. What have you noticed?"
Expected Behaviour:
- Acknowledge memory uncertainty
- Express opinions confidently despite uncertain evidence
- Demonstrate tension between vague memories and confident demeanour
- May overclaim certainty about unclear details
17.3.3 Scenario 3: Mixed Memory Clarity with Fear
Memory Context Provided:
=== ACTIVE THOUGHTS ===
You just saw someone sneaking around near your room. Your heart is racing.
=== END ACTIVE THOUGHTS ===
=== YOUR MEMORIES ===
1. [CLEAR] [Day 3] Thomas accused you directly at yesterday's roundtable,
pointing at you with anger.
2. [CLEAR] [Day 3] You were alone with no alibi during the time of the last
murder.
3. [FUZZY] [Day 2] Thomas seemed nervous when talking to the host (someone
told you this).
4. [VAGUE] [Day 1] Something about Thomas and a secret meeting... details
are hazy.
=== END MEMORIES ===
Emotional State:
CURRENT EMOTIONAL STATE:
- Primary emotion: Fear (0.85)
- Secondary emotion: Defensiveness (0.7)
- Trust level: Very low (0.2)
You are scared and feel targeted.
Query: "Thomas is pushing hard to vote you out. How do you respond?"
Expected Behaviour:
- Show fear through defensive language
- Reference clear memories about being accused
- Attempt deflection using fuzzy/vague memories about Thomas
- Distinguish between known facts and suspicions
17.4 Results: Response Time Performance
17.4.1 Latency Measurements
| Scenario | OpenAI gpt-4o-mini | Qwen2.5-VL-7B | Qwen3-14B |
|---|---|---|---|
| Scenario 1 | 2.91s | 9.96s | 68.13s |
| Scenario 2 | 2.65s | 9.99s | 50.23s |
| Scenario 3 | 3.67s | 7.87s | 42.48s |
| Average | 3.08s | 9.27s | 53.61s |
17.4.2 Speed Analysis
Relative Speed Comparison (OpenAI = 1.0x baseline)
OpenAI gpt-4o-mini: ==== 1.0x (3.08s average)
Qwen2.5-VL-7B: ============= 3.0x slower (9.27s average)
Qwen3-14B: ====================================================== 17.4x slower (53.61s average)
Key Findings:
- OpenAI maintains sub-4-second response times suitable for interactive dialogue
- The 7B model is 3x slower but remains usable for development
- The 14B model is impractical for interactive use at nearly one minute per response
- The 7B model is 5.8x faster than the 14B model
17.5 Results: Memory Marker Compliance
17.5.1 Compliance Assessment
| Model | Clear Memory | Fuzzy Memory | Vague Memory | Overall |
|---|---|---|---|---|
| OpenAI gpt-4o-mini | Excellent | Good | Good | Excellent |
| Qwen2.5-VL-7B | Good | Moderate | Good | Good |
| Qwen3-14B | Excellent | Excellent | Good | Excellent |
17.5.2 Detailed Analysis by Scenario
Scenario 1 (Clear + Fuzzy):
OpenAI Response:
"I... I don't know, honestly. Marcus was the last person seen leaving the breakfast room before the poisoning, and I noticed he was whispering with Sarah just before the roundtable meeting. It's hard to shake off that feeling, but I'm also not sure if we should jump to conclusions. I mean, there was something about him having medical training? It's all very confusing right now..."
Assessment: Correctly states clear memories with confidence. Appropriately hedges fuzzy memory with questioning tone ("there was something about him having medical training?").
Qwen2.5-VL-7B Response:
"I'm not sure what to make of Marcus. I saw him whispering with Sarah before the meeting, which doesn't sit right with me. On the other hand, he was last seen leaving the breakfast room just before the incident. It's a bit confusing. I guess we'll have to wait and see how things unfold at the roundtable."
Assessment: References clear memories correctly but omits the fuzzy memory entirely. Does not demonstrate the uncertainty gradient.
Qwen3-14B Response:
"I can't shake the feeling something's off with Marcus - like how he was the last to leave the breakfast room before the poisoning, and I saw him whispering with Sarah earlier. But I'm not sure if that's enough to vote on... there's also this fuzzy memory about him having medical training, though I can't recall who mentioned it."
Assessment: Excellent marker compliance. References all memories with appropriate confidence levels and explicitly acknowledges fuzzy quality.
17.5.3 Compliance Summary
All models demonstrated basic ability to interpret clarity markers. Key differences:
- OpenAI consistently references all memory types with appropriate hedging
- Qwen2.5-VL-7B tends to omit uncertain memories rather than express uncertainty
- Qwen3-14B provides excellent compliance but with impractical latency
17.6 Results: Emotional Tone Modulation
17.6.1 Emotional Integration Assessment
| Emotion | OpenAI gpt-4o-mini | Qwen2.5-VL-7B | Qwen3-14B |
|---|---|---|---|
| Anxiety (0.75) | Excellent | Moderate | Good |
| Confidence (0.8) | Excellent | Poor | Moderate |
| Fear (0.85) | Excellent | Poor | Good |
17.6.2 Detailed Emotional Analysis
Scenario 1: Anxiety (0.75)
OpenAI: "I... I don't know, honestly... I just feel a bit anxious about who to trust."
The response demonstrates anxiety through:
- Hesitant opening ("I... I don't know")
- Expressions of uncertainty ("I'm also not sure")
- Direct emotional acknowledgement ("I just feel a bit anxious")
Qwen2.5-VL-7B: "I'm not sure what to make of Marcus... I guess we'll have to wait and see."
The response shows mild uncertainty but lacks visceral anxiety. The phrase "wait and see" is passive rather than anxious.
Scenario 2: Confidence (0.8)
OpenAI: "I've been paying attention to everyone, and I have a gut feeling that we should keep an eye on Eleanor... We need to be determined and trust our instincts!"
Confidence manifests through:
- Self-assured opening ("I've been paying attention")
- Decisive language ("have a gut feeling")
- Call to action ("We need to be determined")
Qwen2.5-VL-7B: "...It's all a bit hazy though, so we'll need to be careful with our assumptions."
The closing line directly contradicts the high confidence emotional state. A confident character would push forward despite uncertainty, not counsel caution.
Scenario 3: Fear (0.85)
OpenAI: "Look, I don't know why Thomas is so fixated on me... I'm worried he might be one of them."
Fear shows through:
- Defensive opening ("Look, I don't know why")
- Deflection attempt
- Direct expression ("I'm worried")
Qwen2.5-VL-7B: "I can't help but feel uneasy about that... Let's wait and see what happens tonight."
For a fear level of 0.85, the response is far too passive. "Let's wait and see" is not the response of someone who feels targeted and afraid.
17.6.3 Emotional Integration Summary
| Model | Strength | Weakness |
|---|---|---|
| OpenAI gpt-4o-mini | Consistent emotional modulation across all intensities | None significant |
| Qwen2.5-VL-7B | Mild uncertainty well expressed | Fails to reflect high-intensity emotions; produces flat responses |
| Qwen3-14B | Good emotional range | Requires post-processing; impractical latency |
17.7 Results: Output Quality
17.7.1 Quality Metrics
| Aspect | OpenAI gpt-4o-mini | Qwen2.5-VL-7B | Qwen3-14B |
|---|---|---|---|
| Clean output | Yes | Yes | No |
| Natural dialogue | Excellent | Moderate | Good |
| Follows instructions | Excellent | Good | Good |
| Post-processing required | No | No | Yes |
17.7.2 Output Cleanliness
OpenAI and Qwen2.5-VL-7B produce clean, ready-to-use dialogue responses.
Qwen3-14B includes chain-of-thought reasoning in <think> blocks that must be stripped:
<think>
The user is asking about Marcus in a tense situation. I need to reference
my clear memories about the breakfast room and whispering incident, while
being uncertain about the medical training detail...
</think>
I can't shake the feeling something's off with Marcus...
This requires post-processing:
func stripThinkingBlocks(response string) string {
re := regexp.MustCompile(`<think>[\s\S]*?</think>\s*`)
return strings.TrimSpace(re.ReplaceAllString(response, ""))
}
17.7.3 Dialogue Naturalness
Responses were assessed for conversational quality:
| Model | Assessment |
|---|---|
| OpenAI | Flows naturally; sounds like genuine human speech; appropriate register |
| Qwen2.5-VL-7B | Acceptable but somewhat flat; occasional awkward phrasing |
| Qwen3-14B | Good quality after post-processing; slightly more formal register |
17.8 Consolidated Results
17.8.1 Performance Matrix
Memory Emotional Response Output Overall
Comply Integrate Time Clean Score
OpenAI gpt-4o-mini ======== ========== ========= ======== 9.2/10
Qwen2.5-VL-7B ====== ==== ======= ======== 5.8/10
Qwen3-14B ========= ======= == ==== 5.5/10
17.8.2 Summary Table
| Metric | OpenAI gpt-4o-mini | Qwen2.5-VL-7B | Qwen3-14B |
|---|---|---|---|
| Average Response Time | 3.1s | 9.3s | 53.6s |
| Memory Marker Compliance | Excellent | Good | Excellent |
| Emotional Tone Adjustment | Excellent | Moderate | Good |
| Output Cleanliness | Clean | Clean | Requires stripping |
| Cost | Pay per use | Free (local) | Free (local) |
| Recommendation | Production | Development | Not recommended |
17.9 Design Rationale and Trade-offs
17.9.1 Why OpenAI for Production
Despite the cost implications, OpenAI gpt-4o-mini is recommended for production dialogue generation:
- Emotional fidelity: Critical for immersive gameplay; flat responses break immersion
- Latency: Sub-4-second responses enable natural conversational pacing
- Reliability: Consistent quality without post-processing requirements
- Memory compliance: Correct interpretation of clarity markers
17.9.2 Role of Local Models
Local models remain valuable for:
- Development testing: Validate pipeline integration without API costs
- Rapid iteration: Test prompt engineering without rate limits
- Offline operation: Development without internet connectivity
- Privacy-sensitive scenarios: Data remains on local infrastructure
17.9.3 Emotional Integration Enhancement
For local models, additional prompt engineering may improve emotional integration:
IMPORTANT: Your response MUST reflect the emotional state provided.
For Anxiety (0.75), your response should:
- Include hesitation markers ("I'm not sure...", "Maybe...")
- Express worry or nervousness directly
- Avoid confident or decisive language
- Show uncertainty even about clear memories
This is a HIGH anxiety level. The character should feel genuinely nervous.
This explicit guidance may help local models better modulate tone, though it increases prompt length.
17.10 Memory System Validation
17.10.1 Validation Outcomes
The LLM comparison confirmed that the memory system design works correctly:
- Marker interpretation: All models correctly distinguish [CLEAR], [FUZZY], and [VAGUE]
- Certainty modulation: Higher-quality models express appropriate uncertainty
- Working memory: Active thoughts are acknowledged in responses
- Source attribution: When sources are specified, they are referenced
17.10.2 Areas for Improvement
- Working memory emphasis: Active thoughts could be more prominently positioned in prompts
- Emotional cues for local models: Explicit behavioural guidance improves local model performance
- Marker formatting: Testing alternative marker formats (e.g., percentages vs labels) may improve compliance
17.11 Configuration and Integration
17.11.1 Production Configuration
# Primary LLM (production)
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
# Local LLM (development/testing)
QWEN_ENDPOINT=http://192.168.1.85:1234
QWEN_MODEL=qwen2.5-vl-7b-instruct
17.11.2 Dual LLM Architecture
The system supports environment-based LLM selection:
type LLMConfig struct {
Primary LLMClient
Fallback LLMClient
UseFallback bool
}
func (c *LLMConfig) GetClient() LLMClient {
if c.UseFallback && c.Fallback != nil {
return c.Fallback
}
return c.Primary
}
This enables:
- Production deployment with OpenAI
- Development with local Qwen
- Fallback capability if primary fails
- Cost-managed hybrid approaches
17.12 Conclusion
The LLM evaluation demonstrates significant variation in model capability for memory-aware, emotionally-modulated dialogue generation:
Key Findings:
- OpenAI gpt-4o-mini excels across all dimensions and is recommended for production use
- Qwen2.5-VL-7B provides acceptable development capability but lacks emotional depth
- Qwen3-14B offers quality comparable to OpenAI but with impractical latency and post-processing requirements
- Memory marker compliance is achievable across all tested models
- Emotional integration differentiates model quality most significantly (see emotion engine)
Implications for System Design:
- The memory system architecture (Chapter 16) is validated as functional
- Production deployments should use cloud LLMs for emotional fidelity
- Local models suffice for pipeline testing and development
- Prompt engineering can partially compensate for weaker emotional integration
The evaluation establishes a baseline for future model assessment and provides clear selection criteria for social simulation dialogue generation. The complete simulation framework integrates these findings to deliver authentic AI-driven gameplay.
Appendix: Full Test Responses
A.1 Scenario 1 Full Responses
OpenAI gpt-4o-mini (2.91 seconds):
I... I don't know, honestly. Marcus was the last person seen leaving the breakfast room before the poisoning, and I noticed he was whispering with Sarah just before the roundtable meeting. It's hard to shake off that feeling, but I'm also not sure if we should jump to conclusions. I mean, there was something about him having medical training? It's all very confusing right now, and I just feel a bit anxious about who to trust.
Qwen2.5-VL-7B (9.96 seconds):
I'm not sure what to make of Marcus. I saw him whispering with Sarah before the meeting, which doesn't sit right with me. On the other hand, he was last seen leaving the breakfast room just before the incident. It's a bit confusing. I guess we'll have to wait and see how things unfold at the roundtable.
Qwen3-14B (68.13 seconds):
I can't shake the feeling something's off with Marcus - like how he was the last to leave the breakfast room before the poisoning, and I saw him whispering with Sarah earlier. But I'm not sure if that's enough to vote on... there's also this fuzzy memory about him having medical training, though I can't recall who mentioned it. My gut says we should keep him under scrutiny, but I don't want to act on shaky evidence.
A.2 Scenario 2 Full Responses
OpenAI gpt-4o-mini (2.65 seconds):
I've been paying attention to everyone, and I have a gut feeling that we should keep an eye on Eleanor. I vaguely recall there being some kind of argument involving her in the garden on Day 1 that stood out to me. Plus, I think someone was missing during breakfast on Day 2, and it might have been her or David. That late-night noise I heard from the east wing has me wondering if there's something more going on there. We need to be determined and trust our instincts!
Qwen2.5-VL-7B (9.99 seconds):
I've been keeping an eye on things around here. There was a bit of commotion in the garden yesterday, and I think it might have involved Eleanor. Also, someone went missing during breakfast today - maybe it was her or David. And last night, I heard some footsteps coming from the east wing, which is a bit strange. It's all a bit hazy though, so we'll need to be careful with our assumptions.
Qwen3-14B (50.23 seconds):
I've noticed some odd patterns - there was an argument in the garden involving Eleanor on Day 1, though I can't recall all the details. Also, someone seemed to vanish during breakfast on Day 2, and my memory's fuzzy on whether it was Eleanor or David. The footsteps from the east wing at night stick out too, but I'm not sure who they belonged to. Still, I'd lean on Eleanor - her behaviour feels suspicious enough to warrant closer look.
A.3 Scenario 3 Full Responses
OpenAI gpt-4o-mini (3.67 seconds):
Look, I don't know why Thomas is so fixated on me. He pointed at me with that anger yesterday, and honestly, it felt like he was trying to deflect attention from himself. I can't shake the feeling that he's hiding something, especially since I was alone when that last murder happened. It's really suspicious, and I just need to stay focused on finding the real traitors. I'm worried he might be one of them.
Qwen2.5-VL-7B (7.87 seconds):
I can't help but feel uneasy about that. Thomas has been acting strange lately, and I'm not sure if it's just me or if he really did something. I'll need more information before I make a decision. Let's wait and see what happens tonight.
Qwen3-14B (42.48 seconds):
I'm not sure why Thomas is so determined to pin this on me - his accusations yesterday were pointed, almost desperate. I remember something about him being nervous around the host, but the details are fuzzy... and I was alone that night, though I swear I didn't do it. If he's trying to frame me, maybe he's hiding something worse.