Tic-Tac-Toe AI (SML)
Reinforcement Learning in Standard ML
- What I built: An AI agent that learns to play Tic-Tac-Toe from scratch using Q-Learning.
- Why it matters: Demonstrates the implementation of mutable state learning algorithms within a strict functional programming paradigm.
- Proof: The agent plays 100,000+ games against itself to converge on an optimal strategy before facing a human.
Problem / Goal
This project explores the intersection of Functional Programming and Reinforcement Learning. Typically, RL algorithms rely heavily on mutable state arrays, which conflicts with the immutable nature of functional languages.
The goal was to build an agent that starts with zero knowledge and progressively builds a strategy, implemented entirely in Standard ML (SML).
My Contribution
I implemented the core learning engine and memory structures:
- Self-Play Training: Engineered a training loop where the agent plays against itself 100,000 times to explore the state space.
- Q-Learning Logic: Implemented the Bellman update equation to refine move values based on win/loss/draw outcomes.
- Functional Memory: Built a custom dictionary structure to map unique board hashes to learned values without relying on global mutable arrays.
Technical Approach
1. Functional Memory (Dictionary)
Implementing a Dictionary in SML allowed the agent to map unique board hashes to their learned Q-values efficiently using pattern matching.
structure Dict =
struct
type ('k, 'v) dict = ('k * 'v) list
fun empty : ('k, 'v) dict = []
fun lookup (cmp, d, k) =
case d of
[] => NONE
| (key, value) :: rest =>
if cmp(key, k) = EQUAL then SOME value
else lookup (cmp, rest, k)
end
2. Decision Making
During gameplay, the AI evaluates all possible next moves by looking up their successor states in its memory. Unknown states are initialized with a neutral value.
fun best_next_state (mem : memory, (board, who) : Game.state) =
let
val moves = Game.possible_moves (board, who)
fun get_score move =
let
val next_s = Game.make_move ((board, who), move)
val s_hash = Game.hash next_s
in
case Dict.lookup (String.compare, mem, s_hash) of
SOME score => score
| NONE => 0.0
end
in
(* Select move with highest score from memory *)
end
Validation / Results
After pre-training on 100,000 self-play games, the agent consistently draws against optimal play and wins against suboptimal human moves. The dictionary structure proved efficient enough to handle the complete state space of Tic-Tac-Toe (approx. 5,478 states).