Tic-Tac-Toe AI (SML)

Reinforcement Learning in Standard ML

What I built: An AI agent that learns to play Tic-Tac-Toe from scratch using Q-Learning.
Why it matters: Demonstrates the implementation of mutable state learning algorithms within a strict functional programming paradigm.
Proof: The agent plays 100,000+ games against itself to converge on an optimal strategy before facing a human.

Problem / Goal

This project explores the intersection of Functional Programming and Reinforcement Learning. Typically, RL algorithms rely heavily on mutable state arrays, which conflicts with the immutable nature of functional languages.

The goal was to build an agent that starts with zero knowledge and progressively builds a strategy, implemented entirely in Standard ML (SML).

My Contribution

I implemented the core learning engine and memory structures:

Self-Play Training: Engineered a training loop where the agent plays against itself 100,000 times to explore the state space.
Q-Learning Logic: Implemented the Bellman update equation to refine move values based on win/loss/draw outcomes.
Functional Memory: Built a custom dictionary structure to map unique board hashes to learned values without relying on global mutable arrays.

Technical Approach

1. Functional Memory (Dictionary)

Implementing a Dictionary in SML allowed the agent to map unique board hashes to their learned Q-values efficiently using pattern matching.

structure Dict =
struct
    type ('k, 'v) dict = ('k * 'v) list
    fun empty : ('k, 'v) dict = []

    fun lookup (cmp, d, k) =
    case d of
    [] => NONE
    | (key, value) :: rest =>
        if cmp(key, k) = EQUAL then SOME value
        else lookup (cmp, rest, k)
end

2. Decision Making

During gameplay, the AI evaluates all possible next moves by looking up their successor states in its memory. Unknown states are initialized with a neutral value.

fun best_next_state (mem : memory, (board, who) : Game.state) =
    let
        val moves = Game.possible_moves (board, who)
        fun get_score move =
            let
                val next_s = Game.make_move ((board, who), move)
                val s_hash = Game.hash next_s
            in
                case Dict.lookup (String.compare, mem, s_hash) of
                    SOME score => score
                  | NONE => 0.0
            end
    in
        (* Select move with highest score from memory *)
    end

Validation / Results

After pre-training on 100,000 self-play games, the agent consistently draws against optimal play and wins against suboptimal human moves. The dictionary structure proved efficient enough to handle the complete state space of Tic-Tac-Toe (approx. 5,478 states).

Links

View Code on GitHub