AI Engineering

The Thousand-Page Problem

Building a Gemini-powered system to turn massive document collections into polished presentations. Here's the architecture that actually works.

Listen
Abstract visualization of documents converging into presentation slides through streams of data
01

Two Million Tokens Changed Everything

The math used to be brutal. A typical enterprise PDF clocks in around 175 tokens per page. Your 1,000-page annual report? That's 175,000 tokens minimum. Two years ago, you'd need to build elaborate chunking pipelines, vector databases, and hope the retrieval system surfaced the right passages.

Then Google DeepMind shipped Gemini 2.0 with a 2-million-token context window. Suddenly, you can feed the model your entire document collection and ask it to synthesize a coherent presentation. No chunking. No retrieval. Just raw understanding.

Bar chart showing context window evolution from GPT-3 (23 pages) to Gemini 2.0 (11,400+ pages)
Context windows have grown 500x in five years. Gemini 2.0 can theoretically "read" over 11,000 PDF pages in a single prompt.

The catch? Cost scales linearly with input tokens. Naively shoving a million tokens into every API call will bankrupt your startup before lunch. The real engineering challenge isn't capacity anymore—it's economics.

02

The Hybrid Approach That Actually Ships

Forget the "RAG vs. Long Context" debates. In production, you need both. The winning architecture uses context caching for the expensive stuff and targeted retrieval for the rest.

Here's the pattern: Upload your 1,000 pages once through Gemini's Files API. Pay the cache initialization cost (~$4.50 per million tokens per hour). Then hammer that cached context with cheap retrieval queries to generate your slides. The first query costs dollars; subsequent queries cost fractions of a cent.

The key insight: Google's "File Search" tool handles chunking, embedding, and retrieval automatically. You're not building RAG infrastructure—you're calling an API that does it for you.

For presentation generation specifically, the hybrid workflow looks like this: Use the full context to generate a coherent outline (this is where global understanding matters). Then use targeted retrieval to populate each slide with specific data, quotes, and supporting evidence. The outline ensures narrative flow; the retrieval ensures accuracy.

03

The Production Pipeline

Let's get concrete. A real system processing 1,000 PDF pages into a 10-slide deck follows four stages: Upload, Index, Generate, Export.

Architecture diagram showing PDF upload through Files API, File Search indexing, Deep Think outline generation, Gemini 2.0 slide JSON output, and Google Slides API export
Recommended pipeline: ~3 minutes total, ~$0.15 per generation, 85% accuracy on factual content.

The upload phase handles files up to 2GB through Gemini's Files API. Processing takes 10-20 seconds for a thousand pages. The indexing phase (File Search) automatically chunks and embeds your content—budget 30 seconds.

Generation is where Gemini 2.0 Flash earns its name. Using the "Deep Think" reasoning mode, it generates a presentation outline in seconds. Then parallel calls to the standard model populate each slide with JSON-structured content. Total generation time: ~2 minutes.

Export is the easy part. Your structured JSON maps directly to Google Slides API calls or python-pptx templates. Under 5 seconds.

04

Making the Economics Work

This is where most AI projects die. You build a beautiful prototype, demo it to stakeholders, then realize it costs $5 per presentation to run. Your CFO does the math on 10,000 monthly users and your project becomes a "learning experience."

Horizontal bar chart comparing cost per presentation: Naive long context ($1.25), Cached context ($0.12), RAG pipeline ($0.08), Hybrid recommended ($0.15)
Context caching reduces per-generation costs by 90%. The hybrid approach is slightly more expensive than pure RAG but significantly more accurate.

Three techniques matter:

Context Caching is non-negotiable. You're paying to "teach" Gemini your documents once, then running cheap queries against that knowledge. The cache persists for an hour by default—plenty of time for interactive editing sessions.

Batch API cuts costs 50% for non-interactive workloads. If your users upload documents and come back later for results, batch processing is free money.

Model selection matters more than you think. Use gemini-2.0-flash-thinking only for the outline generation where reasoning quality matters. Use standard gemini-2.0-flash for the slide content where speed trumps depth.

05

The Schema That Prevents Chaos

The hardest-won lesson in AI engineering: never trust unstructured output. Gemini 2.0's native JSON mode enforces your schema at the token level—not through post-processing, not through "please return valid JSON" prompting, but through actual constrained decoding.

Your slide schema becomes the contract between your AI and your UI:

class SlideElement(BaseModel):
    type: Literal["heading", "bullet_list", "image", "chart"]
    content: str
    layout_position: Optional[str]  # "north", "west", "center"

class Slide(BaseModel):
    title: str
    speaker_notes: str
    elements: List[SlideElement]
    background_theme: str

Notice what's missing: pixel coordinates, font sizes, absolute positioning. You're asking the model for semantic layout ("image left, text right") and letting your frontend handle the rendering. This is the Gamma.app insight—separate content generation from layout rendering completely.

When the model outputs layout_position: "west", your React component knows what that means. When it outputs x: 142, y: 387, you've lost the game.

06

What Will Break First

I've watched this pattern enough times to predict the failure modes. Here's what will bite you:

The "one-shot" temptation. You'll want to prompt "Generate a 10-slide deck about Q3 earnings." The model will produce generic, repetitive slides that miss the most important insights. The fix: Chain of Thought. Generate outline first, get human approval, then generate content slide-by-slide.

Image hallucination. Gemini can read charts in your PDFs but cannot reproduce them. If you ask it to "recreate the revenue chart," you'll get a plausible-looking fake. Use PyMuPDF to extract actual images and embed them directly.

Rate limit surprise. Streaming complex JSON burns through output tokens fast. Set max_output_tokens to 8K+ for full deck generation, or generate slide-by-slide to avoid mid-JSON cutoffs.

Coherence drift. Generating slides in parallel is fast but risks inconsistent terminology and tone. Generate the entire outline in a single pass using the full context window, then populate slides in parallel. The outline maintains coherence; parallel generation maintains speed.

The Adjacent Possible

The 1,000-page-to-presentation pipeline isn't hypothetical anymore. The components exist. The economics work (barely). What's missing is the taste—knowing when to summarize and when to quote directly, when to visualize and when to bullet point. That's still, for now, a human job. But the heavy lifting? That's increasingly machine territory.