Transformer Project: Text Generation

GEI898 - Advanced Deep Learning | Feb 12, 2026

Authors: Behrouz Nik-Nejad-Kazem-Pour, Andrei Corduneanu, Kevin Darren Nguemdjom Tchangang

1. Context & Objective

This project focuses on the development of a GPT-style Transformer trained specifically on the tiny_shakespeare dataset. The goal is to build a decoder-only language model capable of generating coherent Shakespearean text, and to systematically explore how architectural choices — embedding size, number of attention heads, and depth — affect generation quality.

2. Architecture & Models

Embedding Generation

Each token is mapped to a learned 512-dimensional vector, giving the model a continuous space where semantically related words cluster together. Position embeddings are added element-wise to each token vector, making word order explicit to a model that otherwise treats input as a set rather than a sequence. Using learnable rather than fixed sinusoidal positional encodings lets the model adapt position representations to the statistical patterns of the training corpus, at the cost of requiring sufficient training data for those positions to converge.

  • Token Embedding: Vocabulary projected into $D=512$ space.
  • Position Embedding: Learnable positional vectors are added to indicate word order, as Transformers lack inherent recurrence.
  • Result: Input = Vector(Word) + Vector(Position).

Architecture & Attention

A decoder-only Transformer generates tokens autoregressively by attending over all prior positions through causal multi-head attention, then projecting to a vocabulary distribution. No source sequence exists to encode, so an encoder-decoder stack would add parameters and training complexity without benefit. Eight attention heads partition the 512-dimensional representation into parallel subspaces, letting the model track multiple dependency types simultaneously, at the cost of attention complexity scaling quadratically with sequence length.

Causal Mask: A lower triangular matrix ($-\infty$) prevents the model from "seeing the future" during training.

3. Implementation & Key Optimizations

Text to Token Conversion

We utilized a 90/5/5 split (Training, Validation, Test). Unlike the standard 70/15/15 split, this maximizes training volume for the tiny_shakespeare dataset, allowing the model to better capture linguistic subtleties.

  • Preparing the Text (String and Words): The text is converted to lowercase and punctuation is isolated as separate tokens, so "king" and "king," map to the same word stem. This prevents the vocabulary from inflating with punctuation variants of the same word.
  • Sequencing: Fixed length ($N=64$) with overlap to increase example density.

4. Results

1. Unconstrained Generation

No structural stopping mechanism; resulted in run-on sentences.

2. Structured Generation

Included punctuation in vocabulary; stops at predicted period (.).

3. Standard Architecture

4 heads, 512 dimensions. Balanced syntax and computation.

4. High Capacity (32 Heads / 800 Dims)

Increased vector size (from 512 to 800) allows the model to retain more "particularities" and semantic nuances for each token.

5. Deep Model (5 Transformer Layers)

Our most powerful model. Stacked blocks allow for higher levels of abstraction, improving overall coherence.

5. Conclusion & Future Perspectives

Each configuration in this sequence isolates one variable: stopping mechanism, head count, or model depth. Comparing output coherence across configurations, rather than tracking a single loss metric, is the appropriate evaluation approach for a generative model where quality is inherently subjective. The coherence improvement from configuration 3 to 5 confirms that stacking transformer layers contributes more to output quality than increasing head count or embedding dimension alone at this scale.