Authors: Behrouz Nik-Nejad-Kazem-Pour, Andrei Corduneanu, Kevin Darren Nguemdjom Tchangang
This project focuses on the development of a GPT-style Transformer trained specifically on the tiny_shakespeare dataset. The goal is to build a decoder-only language model capable of generating coherent Shakespearean text, and to systematically explore how architectural choices — embedding size, number of attention heads, and depth — affect generation quality.
Each token is mapped to a learned 512-dimensional vector, giving the model a continuous space where semantically related words cluster together. Position embeddings are added element-wise to each token vector, making word order explicit to a model that otherwise treats input as a set rather than a sequence. Using learnable rather than fixed sinusoidal positional encodings lets the model adapt position representations to the statistical patterns of the training corpus, at the cost of requiring sufficient training data for those positions to converge.
Vector(Word) + Vector(Position).
A decoder-only Transformer generates tokens autoregressively by attending over all prior positions through causal multi-head attention, then projecting to a vocabulary distribution. No source sequence exists to encode, so an encoder-decoder stack would add parameters and training complexity without benefit. Eight attention heads partition the 512-dimensional representation into parallel subspaces, letting the model track multiple dependency types simultaneously, at the cost of attention complexity scaling quadratically with sequence length.
We utilized a 90/5/5 split (Training, Validation, Test). Unlike the standard 70/15/15 split, this maximizes training volume for the tiny_shakespeare dataset, allowing the model to better capture linguistic subtleties.
No structural stopping mechanism; resulted in run-on sentences.
Included punctuation in vocabulary; stops at predicted period (.).
4 heads, 512 dimensions. Balanced syntax and computation.
Increased vector size (from 512 to 800) allows the model to retain more "particularities" and semantic nuances for each token.
Our most powerful model. Stacked blocks allow for higher levels of abstraction, improving overall coherence.
Each configuration in this sequence isolates one variable: stopping mechanism, head count, or model depth. Comparing output coherence across configurations, rather than tracking a single loss metric, is the appropriate evaluation approach for a generative model where quality is inherently subjective. The coherence improvement from configuration 3 to 5 confirms that stacking transformer layers contributes more to output quality than increasing head count or embedding dimension alone at this scale.