SML Text Classification

Functional Framework using Naïve Bayes and MapReduce

What I built: A robust text classification framework from scratch in Standard ML.
Why it matters: Demonstrates how functional programming paradigms (immutability, higher-order functions) apply to data science tasks.
Proof: Implemented a reusable "Extract-Combine" pattern that mimics MapReduce for parallel data processing.

Problem / Goal

Data processing tasks are typically associated with imperative languages (like Python or C++). The goal of this project was to implement a rigorous text classification system using Standard ML, leveraging functional programming paradigms like higher-order functions and immutable data structures.

At its core, the framework needed to provide a functional analogue to MapReduce to efficiently process large document sets and train a Naïve Bayes classifier.

My Contribution

I engineered the core functional abstractions:

Extract-Combine Pattern: Developed a higher-order function that maps inputs to key-value pairs (extract) and reduces them (combine), enabling modular parallelization.
Naïve Bayes Logic: Implemented the statistical classifier to probability distributions of document categories.
Modular Architecture: Designed the system to separate the parallel execution engine from the specific classification logic, allowing plug-and-play components.

Technical Approach

1. The MapReduce Abstraction (Extract-Combine)

The `extractcombine` function is the engine of the framework. It handles the dictionary management and collision resolution entirely through recursion and pattern matching.

fun extractcombine (compare_fn, extractor_fn, combiner_fn, dataset) =
    let
        (* Maps key-values into the dictionary, combining collisions *)
        fun combinePair (dict, (key, value)) =
            case Dict.lookup (compare_fn, dict, key) of
                NONE => Dict.insert (compare_fn, dict, (key, value))
              | SOME existing =>
                Dict.insert (compare_fn, dict, (key, combiner_fn (existing, value)))
    in
        (* ... Implementation details omitted for brevity ... *)
        combined_result
    end

2. Declarative Training Pipeline

Training the classifier becomes a concise series of functional transformations. Here, we count total words per category by mapping each document to a (Category, WordCount) pair and summing them.

(* Calculate the total number of words in each category *)
val countWordsPerCategory =
    ExtractCombine.extractcombine (
        String.compare,
        (* Extract: Map each document to (Category, WordCount) *)
        (fn (category, doc) => Seq.singleton (category, Seq.length doc)),
        (* Combine: Sum WordCounts for the same Category *)
        (fn (x, y) => x + y),
        train)

Validation / Results

The framework successfully classifies text documents with high accuracy on standard datasets. More importantly, the modular design allows exact same `extractcombine` engine to be used for unrelated tasks (like log analysis or word counting) without modification.

Links

View Repository on GitHub