Image processors preprocess vision inputs, feature extractors preprocess audio inputs, and a processor handles multimodal inputs. |
Image processors preprocess vision inputs, feature extractors preprocess audio inputs, and a processor handles multimodal inputs. |