This is a research demo of the LENS (LLMs ENhanced to See) system, which leverages a set of vision modules to describe images to an LLM. You can talk to LENS by uploading images and asking questions about them.
LENS is formidable against strong competitors like Flamingo and BLIP-2 on academic benchmarks, which is surprising because we don't tune the LENS LLM to understand images at all. Although, it's important to acknowledge that LENS still fails a lot, and you should not use it for anything high-stakes. Often the failures are amusing and part of the fun 😆, but sometimes they are not.
The models which make up LENS are trained on samples from the internet, which may contain harmful content. LENS may propagate this information, which could include:Note that any data you provide may be logged. You can help us make the models better by upvoting helpful responses, downvoting unhelpful or incorrect responses, or flagging harmful content.
See the blogpost and the paper for more information.
The LENS system leverages a set of vision modules to describe images to an LLM.
For this demo of LENS, the vision modules that we use are CLIP and BLIP. And the LLMs that we use are Flan-T5, OpenAssistant, and ChatGPT
LENS is formidable against strong competitors like Flamingo and BLIP-2 on academic benchmarks, which is surprising because we don't tune the LENS LLM to understand images at all. Although, it's important to acknowledge that LENS still fails a lot, and you should not use it for anything high-stakes.
The models which make up LENS are trained on samples from the internet, which may contain harmful content. LENS may propagate this information, which could include: