Am I missing something basic here?
I always thought omnimodal just meant the AI can take in multiple inputs (like text, audio, and video) at the exact same time and process them. Is that wrong? Because people keep talking about it like it is this massive architectural revolution, but have we not been uploading images alongside text prompts to ChatGPT for a while now?