Vision is the central technological element in some of the best works of science fiction like iRobot, 2001: A Space Odyssey, and The Matrix. Without significant advancements in computer vision, these works would have no plot! The robots in iRobot interact so fluidly with their world because they ingest and process vast amounts of visual information. What is HAL from 2001 besides a giant red camera – an eye – that represents the powerful awareness of AI. The Matrix is itself an unbelievably robust computer-generated, visual world. While we’re more glass half full than “robots are taking over”, we agree with the thrust of these works that computer vision is the foundation for awe-inspiring technological advancements.
If you watched these movies growing up, it was probably hard to picture any of it becoming reality. However, while some innovations are still on the horizon, others are just around the corner. CVPR is a great opportunity to get a glimpse at what will soon be viewed as real-world science – no longer fiction. Below are four key areas we are particularly interested in:
Visual Reasoning: Interacting with the World like Never Before
We take for granted our ability to look at the world around us and rapidly analyze the actions of everyone and everything in the scene to predict what’s about to happen. But if you’ve ever played an unfamiliar level in a video game, you’ve likely experienced the frustration of starting over and over again as new traps and enemies appear as you progress (cc: Elden Ring). For so long, that’s how computer vision models viewed the world: a structured process of familiar pattern recognition that would break if presented with anomalies and edge cases.
Advancements in visual reasoning capabilities like the ones discussed in this award-nominated paper allow for better predictions of complex interactions between multiple agents. At the conference, we’re eager to attend workshops like End-to-End Autonomous Driving and Open-Domain Reasoning Under Multi-Modal Settings, alongside speakers from DeepMind, Waymo, and other industry leaders.
Multi-Modality: The Rise of the AI Agent
When was the last time you only interacted with text for a full day? Probably never! We absorb information through sounds, visuals, text, code, smells, etc. It’s what makes our world so rich, and so complex for computers to process. Multi-modality is all about fusing different types of data—such as images, text, audio, and video—into a comprehensive model of the world.
One of the papers at CVPR introduces an innovative, audio-video generation model. This model appears to be capable of delivering an enhanced viewing experience, but that’s just the start. In the future, the authors hope to add more capabilities such as editing and expand accessibility with a user-friendly UI.
Efficiency: Pocket-Sized CV
Unless we’re going to start carrying Graphics Processing Units (GPUs) in our backpacks, we believe the power of CV advancements will only reach scale in the real world if the models can work robustly and efficiently on the edge. Researchers are consistently striving to streamline the speed and resource utilization of AI models, making them more widely applicable to everyday life. Methods like distillation, compression, parallelization, and distributed training work to reduce model size and complexity without sacrificing performance.
One paper explores a new method to tackle resource-intensive sampling in diffusion models. The authors demonstrate that a new distillation approach can be used to realize an order of magnitude improvement in efficiency, lowering compute costs for high resolution images.
Generative AI: Expanding the Content Horizon
How many times have you taken 15 photos to get the perfect shot? Generative AI gives you the tools to edit photographs, generate novel views, or produce entirely new worlds and creatures with natural language prompts. We expect these developments to propel generative AI towards more enterprise use cases, from video game and film production to architecture and product design, even extending to implications in fashion, art, and marketing.
The papers presented at the conference cover a wide range of topics, from generating new perspectives of moving scenes to predicting the articulation structure of 3D objects. The AI for content creation workshop adds to the excitement, as industry and academic leaders will come together to discuss what it will take to thoughtfully incorporate AI into content creation processes.
We believe that addressing everyday paint points is key to catalyzing the transition to commercialization. With that in mind, below are end-uses that are top of mind for us.
- Smart Manufacturing and Quality Control: Cost of poor-quality production can be as high as 20% of revenue, reinforcing a need for more effective quality control technology. Industrial inspection is a highly repetitive and extremely nuanced task, but AI models can process vast amounts of data from production lines and detect deviations or anomalies in real-time to minimize production errors.
- Retail and E-commerce Personalization: Rapidly changing consumer preferences are making it even harder for retailers to make merchandising decisions, leading to inventory misjudgments that can cost up to 12% of sales per year. Generative AI and computer vision can help retailers reduce this cost with personalized recommendations, virtual try-ons, and predictive analytics. We’re looking forward to discussing this at the RetailVision workshop.
- Automated Video Production and Editing: Studios spend hundreds of millions on visual effects (VFX) each year even as VFX studios struggle to survive increasingly tight margins. Further, video-first social media like TikTok and Instagram have turned almost everyone into potential creators. Advancements in generative AI offer a toolkit that both studios and amateur creatives could use to dream up new worlds with far fewer resources. To us, it’s not a question of whether but when and how AI will be more deeply incorporated into creative processes.
If you’re excited about these use cases or the themes highlighted above, shoot us an email or find us at the conference – we’d be happy to discuss interesting papers or a novel application of research in the real world.