Author is learning diffusion/flow matching models from scratch. Initially targeted audio generation with mel-spectrograms but pivoted to images after discovering the core blocker: audio evaluation is subjective (hard to visually assess generated spectrograms) and mel-to-audio conversion adds unnecessary complexity. Using celebrity image datasets to master diffusion mechanics before returning to audio. Key insight: choosing a problem domain with objective evaluation criteria—rather than one with additional conversion complexity—lets you isolate and master the fundamental concepts more effectively.