Mukul Pathak
2 min readOct 30, 2023

--

Lost in Translation: The Disconnect Between User Prompts and DALL-E3 Outputs

Overview

DALL-E3 uses a ChatGPT-like model to enhance user prompts for better image generation. This mediator aims to align image generation with ethical guardrails and improve “understandability” for DALL-E3. Despite these objectives, this middle layer causes several issues including delusions of expected output, inaccurate output, and hallucinations that arise from repeated attempts to get the desired result.

Flow of How Dalle3 takes user input

Benefits/What It Aims to Do

  1. Guardrails: Ensures that the images generated align with organizational and ethical guidelines.
  2. Enhance for Understandability: Refines the prompt for better comprehension by DALL-E3.

What’s Really Happening

  1. Delusion on What’s Expected: The User Prompt is not equivalent to the GPT Enhanced-User Prompt, leading to a mismatch between expectation and result.
  2. Wrong Output: The Enhanced Prompt may miss or modify important details, causing DALL-E3 to generate unintended images.
  3. Hallucination: Repeated attempts by frustrated users force the model to make increasing errors, resulting in hallucinated or irrelevant outputs.
Image showing how wrong the outputs come

The Problem of Multiple Layers of Interpretation

Every time a user retries to achieve their desired result, three distinct prompts emerge: the User Prompt, the Enhanced Prompt, and what DALL-E3 understands. This multi-layered process complicates the user’s aim to get a straightforward output, adding a layer of convolution that can frustrate and confuse.

Change in prompt complicates the ask

The Complexities

  1. Psychological Complexity: Cognitive dissonance occurs when there’s a disconnect between the user’s expectation and the model’s output.
  2. Computational Complexity: The…

--

--