1-4 How to Use Multimodal Features to Get Things Done
April 19, 2025
In previous article, we covered Cursor's basic setup and essential keyboard shortcuts. Now let's explore how to use multimodal features to work more efficiently.
"Multimodal" is a fancy way of saying the AI can handle different types of information – not just text, but also images, videos, and voice. Think of it like having a conversation with someone who can see what you're pointing at, not just hear your words.
Remember the early days of ChatGPT in late 2022 and early 2023? Back then, it was purely text-based – you typed in text, and it gave you text back. Now ChatGPT can handle images, generate pictures, and even respond with voice. That's the evolution from single-modal to multimodal.
Cursor brings this same multimodal power to coding. This is particularly useful for frontend work where you need to recreate specific designs or UI components. Sometimes describing what you want in words is like trying to explain a sunset to someone who's never seen one – possible, but a picture makes it instantly clear.
How to Upload Images in Cursor
Let's say you want Cursor to build a component based on a design you found. Here's how to do it:
Take a screenshot of the design you want to recreate. This gives Cursor a visual reference for what you're trying to build.
Paste the image into Cursor. When you do this, you'll see an "image" tag appear. Hover over it to preview the image and make sure it's the right one.
Add your prompt along with the image. Cursor will analyze both the image and your text instructions to generate the code. Click "Apply" and watch your component come to life.

This is incredibly powerful for frontend development. Instead of writing lengthy descriptions like "I want a button that's blue with rounded corners and a subtle shadow," you can just show Cursor exactly what you want.
Using Voice Input with Cursor
Now let's talk about voice input – literally coding by talking. This requires a separate tool, but it's worth the setup. Addy Osmani wrote about this in his article Speech-to-Code: Vibe Coding with Voice, where he recommends a tool called superwhisper.
superwhisper really does let you "code with your mouth." It's surprisingly accurate and feels natural once you get used to it.

How to Set Up superwhisper with Cursor
Getting voice input working with Cursor is straightforward:
Download superwhisper: Go to https://superwhisper.com/ and click "Download Now." Install it like any other app.
Use it with Cursor: Open both Cursor and superwhisper. Now you can speak your instructions instead of typing them.

Voice input is particularly useful when you're thinking through a problem out loud or when your hands are busy with something else. It's not something you'll use every day, but when you need it, it's incredibly convenient.
Why This Matters
These multimodal features give you new ways to communicate with your AI assistant:
- Image upload: Perfect for recreating UI designs quickly and accurately
- Voice input: Great for when you want to think out loud or when typing isn't convenient
You might not use these features every day, but it's worth experimenting with them. The more ways you can communicate with your AI tools, the more effective they become.
Support ExplainThis
If you found this content valuable, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.
Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.