Building an AI-powered Course Generator POC [First Thoughts]

New

In this post, I share some quick insights from building an AI-powered course generator using OpenAI’s o4-mini and 4o-mini models. I touch on prompt engineering, efficient workflows, and practical lessons learned.

Introduction

I developed an AI-powered course generator as a practical exploration of OpenAI’s APIs and best practices, focusing on creating personalized, structured educational content quickly. The current prototype produces comprehensive, customized courses on various topics within three minutes and at a cost of less than a quarter per course.

My motivation was simple: deepen my understanding of building AI applications and to improve the way I learn new information.

Project Overview

For this proof of concept, I chose Fastify and OpenAI’s SDK for a quick and straightforward setup. The current version is intentionally minimal, lacking schema validation, authentication, or deployment strategies, to prioritize rapid local experimentation and iteration.

Modular Course Generation Workflow

Initially, I tried generating entire courses with a single prompt, but that resulted in superficial and insufficiently detailed output. The outlines always sounded good, but the actual lesson content lacked depth. Thus, the architecture evolved into a sequential, modular flow with distinct steps:

Metadata Generation: Create clear course title and description from a user-supplied topic (eg. I want to learn React) and learning goals (eg. I want to be able to build web apps).
Course Outline: Structuring high-level outlines for course modules, each designed to progress towards the user’s ultimate learning goals seamlessly.
Detailed Module Expansion: Iterate through each module, and outline structured lesson plans with more focused outcomes.
Content Population: Populate lessons with dynamic content types like slide presentations, case studies, and quizzes.

This approach not only improved course quality but also makes it easier to handle scalability and integration of techniques like retrieval-augmented generation (RAG) for grounded course content.

Ensuring Determinism

Achieving predictable, high-quality outputs required careful prompt construction and rigorous schema enforcement. A few things I made sure to use:

OpenAI’s structured response format and explicitly defined JSON schemas for a predictable response structure.
Detailed instructions within prompts, including explicit “TASK”, “OUTPUT”, and “QUALITY GATE” guidelines.
Silent self-validation protocols within the prompts. This tells models to internally verify and correct their outputs silently before responding, ensuring each generated lesson consistently meets the defined criteria in each prompt.

For instance, prompts contained self-validation steps like:

SELF-VALIDATION (perform silently)

- Verify slides array has at least 5 entries.
- Check that each slide type conforms to predefined categories.
- Confirm all key points are adequately addressed.

SDK calls were encapsulated within a helper function, incorporating retries and a local caching mechanism to reduce API instability and save tokens while working on the later stages of the flow.

Early Optimizations

Caching: Mitigating Token Costs

The per request costs for the prompts are very affordable, but I did see the costs starting to climb exponentially during development because of how quickly I was iterating through prompt styles. To reduce costs, I implemented a simple caching strategy that would take the variable input of each step and hash it with the prompt id.

Example:

const hash = crypto
  .createHash("md5")
  .update(requestConfig.promptId + JSON.stringify(input))
  .digest("hex");

This hashing strategy reduces costs, but small variations in input like “learn React” vs. “learning React.js” result in unnecessary cache misses. A semantic normalization layer will be essential to improve hit rates. OpenAI offers a platform (solution)[https://platform.openai.com/docs/api-reference/graders/text-similarity] for this I intend to test out.

Multi-model Pipeline

Not all models are created equally. Each has its own unique set of strengths and weakness and recognizing when is the right time to use a particular model will have profound effects on your system. For this project I stuck with two models, 1 reasoning and 1 general purpose.

export const MODEL_IDS = {
  GENERAL_MODEL: "gpt-4o-mini",
  REASONING_MODEL: "o4-mini-2025-04-16",
};

The general-purpose model handled straightforward, structured tasks, while I used the reasoning model for planning tasks that were nuanced, like generating course outlines relevant to the user’s learning goals.

Parallelization

Using Promise.all to batch fan-out requests had a significant impact on speeding up the overall processing time. The process still is a bit longer than I’m hoping for, between 2-3 minutes, but this may be something I need to live with as the more powerful models will run slower than the mini models I’ve been using for development.

Challenges & Lessons Learned

Prompt Design: Finding the right balance between comprehensive instructions and brevity takes a lot of trial and error. Prompts too verbose caused confusion, while insufficient instructions led to unreliable outputs, failed retries, and increased cost.
Caching Efficiency: Initial caching strategies based solely on serialization and hashing proved inadequate for handling input variations. Implementing semantic normalization is the next approach to explore.
Parallelization: While parallelization enhanced performance, there are practical limits due to token usage and model rate limits. I’ll spend more time thinking about this in the future problem to solve.

Insights & Key Takeaways

This experiment underscored several valuable insights:

Explicit, structured prompting with built-in self-validation drastically improved model reliability.
Modularizing tasks into discrete steps improved scalability, flexibility, and overall output quality.
Proper context management proved critical: excessive detail confused the models, while insufficient context reduced output quality. A careful balance was essential.

Future Direction

Next steps for evolving this POC into a robust, scalable platform include:

Experimenting with model parameters, like temperature, to balance creativity and accuracy and fine-tuning specific models for targeted tasks.
Implementing rigorous schema validation (using Zod) and robust authentication for a secure, scalable API for the frontend.
Extending the system’s capabilities by diversifying lesson types and content richness.
Implementing RAG for lesson generation to prevent hallucinations and maintain high factual accuracy of course content.
A more sophisticated caching strategy. I can hopefully leverage a text similarity grader to increase cache hits.

Are you tackling similar prompt optimization or caching challenges? I’d love to exchange experiences. Feel free to reach out!