From the course: Tech Trends

GPT-4o, multimodal AI, and more

From the course: Tech Trends

GPT-4o, multimodal AI, and more

- OpenAI had their spring update on May 13th, 2024, where they released their latest model named GPT-4o for omni and a bunch of other stuff. Here's the breakdown and what you need to know. The main update is the release of GPT-4o, the first model that integrates text, image and audio that can be used to combine all these three modalities in both the input and the output phase. In effect, this means what used to require multiple operations turn speech into transcription, then run the transcription through GPT, then turn the response back into speech is now simplified and takes significantly less time. The new GPT-4o model is described as being two times faster than GPT-4 Turbo and the GPT-4o API is two times faster, 50% cheaper, and has five times higher rate limits than GPT-4 Turbo. Long story short, GPT-4o replaces GPT-4 Turbo as the new benchmark model from OpenAI today. With this in mind, here are the four things you need to know. Number one, ChatGPT with GPT-4o is now free for everyone. With the release of GPT-4o, OpenAI also opens the gate for everyone to use the latest and most powerful model on ChatGPT, even those without an account. All users, including free users, now have full access to GPTs from the GPT Store, vision, web browsing, memory, and advanced data analysis. Premium, Team and Enterprise users also get better performance, higher use limits, so 80 messages every three hours, and earlier access to new features. When free and premium users exceed their usage limits, ChatGPT will revert back to GPT-3.5 Turbo, as before. Number two, multimodal is rapidly becoming the default. Our science fiction dream of an omniscient voice-controlled AI assistant gets ever closer. And over the next few weeks, OpenAI will roll out full live voice and vision capabilities for ChatGPT, meaning you can now have more fluid conversations with the app and show it things through the device camera and get responses in real time. New to the audio model is significantly reduced lag, leading to more natural conversations, the ability to interrupt the model in mid-sentence and mid-reasoning, and having the model attempt to identify and respond to your emotional state with a similar and appropriate emotional tone. The voice model is also coupled with an improved live vision model where you can ask ChatGPT about what the camera is seeing in real time and get voice feedback. In the demo during the launch, the team had ChatGPT lead them through a basic math problem and commenting in real time on the math as the user wrote it out on a piece of paper. This hints at an immediate future where AI assistants not only respond to prompts, but can be set up to actively take part in solving tasks. Number three, to help users take advantage of these multimodal features, OpenAI is releasing a ChatGPT desktop app with new integration features. The app is macOS only as of this recording and features the standard ChatGPT voice mode and image upload on release. GPT-4o's new voice and video features will come in the app in the future, according to OpenAI. Using the desktop app, you can now talk to ChatGPT directly without opening your browser or device and ask it to look at screenshots from your desktop to do things like help solve a coding problem or provide information and feedback on an image or analyze a graph. And all of this brings us to number four: new security implications. Broader availability, new input and output modalities and ChatGPT as a native desktop app highlights existing security issues and introduces new ones, especially for the enterprise. As ChatGPT becomes easier to use through its voice and vision capabilities, users are likely to perceive the app more and more as a natural collaboration partner, and once they have it as a native desktop app they can talk to and share their screen with, policies and guardrails around when and how to use AI assistants in your work become paramount. Put in plain English, there's a huge UX difference between copying and pasting or taking a picture of code or information to be processed by third-party app and just clicking a button in an app and ask What's wrong with this code? Or help me understand the spreadsheet. Bottom line, with the new capabilities of GPT-4o comes a heightened urgency for robust policies, practices, and oversights when it comes to AI use in any privacy and security-oriented environment. OpenAI has been the de-facto leader in the generative AI space since the release of ChatGPT in November 2022. This release of GPT-4o, combined with the unlocking of ChatGPT, shows the company pushing themselves and the entire leader pack towards the future where multimodal conversational AI with voice and vision input is front and center. This is in line with what all the AI companies are doing right now, and I expect we'll see this type of multimodal interface become the new standard for our AI interactions very soon. The future is here today and it's multimodal.

Contents