Authors: Behrouz Nik-Nejad-Kazem-Pour, Andrei Corduneanu, Kevin Darren Nguemdjom Tchangang
This project focuses on the development of a GPT-style Transformer trained specifically on the tiny_shakespeare dataset. The goal was to build a generative model from scratch that could predict subsequent tokens in a sequence, effectively mimicking the unique syntax and prose of William Shakespeare.
By experimenting with various configurations—ranging from embedding dimensions to the number of attention heads—we documented the evolution of the model's performance. The final iteration represents a deep learning architecture capable of producing structurally sound and stylistically consistent text, showcasing the power of self-attention mechanisms in natural language processing.
Before training, raw data undergoes preprocessing. We utilized a 90/5/5 split (Training, Validation, Test). Unlike the standard 70/15/15 split, this maximizes training volume for the tiny_shakespeare dataset, allowing the model to better capture linguistic subtleties.
The model converts integer indices into vectors rich in semantic information:
Vector(Word) + Vector(Position).
Based on "Attention Is All You Need," we built a Decoder-Only Transformer. The core is the Causal Multi-Head Attention (8 heads), which splits the 512-dimension space into 8 subspaces of 64. This allows the model to focus on various linguistic aspects like grammar and rhyme simultaneously.
We tested various configurations to observe the impact of hyperparameters.
No structural stopping mechanism; resulted in run-on sentences.
Included punctuation in vocabulary; stops at predicted period (.).
4 heads, 512 dimensions. Balanced syntax and computation.
Increased vector size (from 512 to 800) allows the model to retain more "particularities" and semantic nuances for each token.
Our most powerful model. Stacked blocks allow for higher levels of abstraction, improving overall coherence.