Author learning deep learning by building diffusion/flow matching models from scratch. Initially targeted audio generation using mel-spectrograms, but faced two obstacles: difficulty evaluating melspectrogram quality visually and complexity converting mel-spectrograms back to audio. Key pivot: shifted to image domain with celebrity datasets to master core diffusion/flow matching mechanics before returning to audio. Strategy prioritizes learning fundamental generation principles thoroughly before tackling domain-specific complexity.