Use a [pipeline] for audio, vision, and multimodal tasks.