Author learning diffusion/flow matching from scratch, pivoted from audio mel-spectrogram generation to image domain due to evaluation difficulty and conversion complexity. Using celebrity image dataset to master core generation mechanics on simpler domain before tackling audio. Key insight: deliberate learning progression—master fundamentals in accessible domain (images) before applying to harder domain-specific challenges (mel-to-audio conversion).