Transformer Project: Text Generation

GEI898 - Apprentissage profond avancé | Feb 12, 2026

Authors: Behrouz Nik-Nejad-Kazem-Pour, Andrei Corduneanu, Kevin Darren Nguemdjom Tchangang

Project Overview

This project focuses on the development of a GPT-style Transformer trained specifically on the tiny_shakespeare dataset. The goal was to build a generative model from scratch that could predict subsequent tokens in a sequence, effectively mimicking the unique syntax and prose of William Shakespeare.

By experimenting with various configurations—ranging from embedding dimensions to the number of attention heads—we documented the evolution of the model's performance. The final iteration represents a deep learning architecture capable of producing structurally sound and stylistically consistent text, showcasing the power of self-attention mechanisms in natural language processing.

1. Text to Token Conversion

Before training, raw data undergoes preprocessing. We utilized a 90/5/5 split (Training, Validation, Test). Unlike the standard 70/15/15 split, this maximizes training volume for the tiny_shakespeare dataset, allowing the model to better capture linguistic subtleties.

  • 1. Preparing the Text (String and Words) The process begins by loading the entire Shakespeare dataset as one continuous string. To clean the data, the text is converted to lowercase, and punctuation is isolated so that marks like periods or commas are treated as separate tokens. This ensures that "king" and "king," are seen as the same fundamental word. The big string is then split into a list of individual words and punctuation marks.
  • Sequencing: Fixed length ($N=64$) with overlap to increase example density.

2. Embedding Generation

The model converts integer indices into vectors rich in semantic information:

  • Token Embedding: Vocabulary projected into $D=512$ space.
  • Position Embedding: Learnable positional vectors are added to indicate word order, as Transformers lack inherent recurrence.
  • Result: Input = Vector(Word) + Vector(Position).

3. Architecture & Attention

Based on "Attention Is All You Need," we built a Decoder-Only Transformer. The core is the Causal Multi-Head Attention (8 heads), which splits the 512-dimension space into 8 subspaces of 64. This allows the model to focus on various linguistic aspects like grammar and rhyme simultaneously.

Causal Mask: A lower triangular matrix ($-\infty$) prevents the model from "seeing the future" during training.

4. Model Evolution & Text Generation Results

We tested various configurations to observe the impact of hyperparameters.

1. Unconstrained Generation

No structural stopping mechanism; resulted in run-on sentences.

2. Structured Generation

Included punctuation in vocabulary; stops at predicted period (.).

3. Standard Architecture

4 heads, 512 dimensions. Balanced syntax and computation.

4. High Capacity (32 Heads / 800 Dims)

Increased vector size (from 512 to 800) allows the model to retain more "particularities" and semantic nuances for each token.

5. Deep Model (5 Transformer Layers)

Our most powerful model. Stacked blocks allow for higher levels of abstraction, improving overall coherence.