WildVision Arena and the Battle of Multimodal AI: We Are Not the Same

WildVision Arena and the Battle of Multimodal AI: We Are Not the Same
Black and white line drawing generated by the ControlNet Canny model, depicting a woman in a wetsuit holding a surfboard on the beach. A text from CLIP Interrogator, describing an image, is superimposed over the drawing and reads: "a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism."

Vision-language models are one of the many cornerstones of machine learning and artificial intelligence discussed in this blog. I once ‘predicted’ AI will only be a threat to humanity the day it grasps self-deprecating humor. Many of us have experimented with chatbots helping brainstorm and craft text-to-image prompts, and models like CLIP and DeepBooru are practical tools that are widely used in text-to-image generation. CLIP or DeepBooru are among the earliest generative models that could be defined as ‘vision-language’. This post is not intended to be a tutorial or deep diving into what a vision-language model is, but I thought the example use case below, inspired by the featured image of an old blog post, would help illustrate where this kind of models are coming from and where they are headed… the long way towards Multimodal AI:

Screenshot of a Stable Diffusion web interface in Automatic1111, displaying a feature where an img2img prompt is highlighted in yellow, reading "a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism." This prompt was created by the "Interrogate CLIP" feature, as indicated by a red oval around the button. On the left side of the screen, a large image of a woman in a wetsuit holding a surfboard on the beach is visible. To the right, a sequence of six smaller images are shown, representing variations of the large image that were generated after the "Generate" button was clicked.
Screenshot of a Stable Diffusion web interface in Automatic1111, displaying an img2img prompt highlighted in yellow, reading “a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism.” This prompt was created by the “Interrogate CLIP” feature, as indicated by a red oval around the button. On the left side of the screen, a large image of a woman in a wetsuit holding a surfboard on the beach is visible. To the right, a sequence of six smaller images are shown, representing variations of the large image that were generated after the “Generate” button was clicked.
[Caption by ALT Text Artist GPT] [Click to enlarge]
Screenshot of a Stable Diffusion web interface in Automatic1111, showing an img2img prompt in yellow highlight with the text "1girl, aircraft, asian, beach, black hair, blurry, blurry background, bodysuit, cloud, day, depth of field, horizon, lips, ocean, outdoors, planet, realistic, red lips, retro artstyle, short hair, sky, solo, space, surfboard, water, wetsuit." This prompt was generated by clicking the "Interrogate DeepBooru" button, encircled in orange on the screen. On the left, a prominent image of a woman with black hair in a wetsuit holding a surfboard on a beach is visible. On the right, there are six derivative images generated by the system based on the initial image and prompt.
Screenshot of a Stable Diffusion web interface in Automatic1111, showing an img2img prompt in yellow highlight with the text “1girl, aircraft, asian, beach, black hair, blurry, blurry background, bodysuit, cloud, day, depth of field, horizon, lips, ocean, outdoors, planet, realistic, red lips, retro artstyle, short hair, sky, solo, space, surfboard, water, wetsuit.” This prompt was generated by clicking the “Interrogate DeepBooru” button, encircled in orange on the screen. On the left, a prominent image of a woman with black hair in a wetsuit holding a surfboard on a beach is visible. On the right, there are six derivative images generated by the system based on the initial image and prompt. [Caption by ALT Text Artist GPT] [Click to enlarge]

Particularly since ChatGPT-Vision was released last September, there has been an increase in mainstream adoption of these models. Increasingly sophisticated versions are integrated into ChatGPT, the most popular chatbot, as well as its main competitors like Gemini (formerly Google Bard) and Copilot. The intense competition in the chatbot space is reflected in the ever-increasing amount of contenders in the LMSYS Chatbot Arena leaderboard, and in my modest contribution with the SCBN Chatbot Battles I’ve introduced in this blog and complete as time allows. Today, we’re exploring WildVision Arena, a new project in Hugging Face Spaces that brings vision-language models to compete. The mechanics of WildVision Arena are similar to those of LMSYS Chatbot Arena. It is a crowd-sourced ranking based on people’s votes, where you can enter any image (plus an optional text prompt), and you will be presented with two responses from two different models, keeping the name of the model hidden until you vote by choosing the answer that looks better to you. I’m sharing a few examples of what I’m testing so far, and we’ll end this post with a ‘traditional’ SCBN battle where I will rank the vision-language models based on my use cases.

As more people join WildVision Arena and cast their votes, there will be more comparative data about these models, and their owners will be even more incentivized to improve them and remain competitive. The space was brand new at the time this post is being written, so the is still no ranking available, but it includes the following list of supported models (each linking to their corresponding Hugging Face page):


1. Learning Is Just a Byproduct of the Correlation-Causation Fallacy

✍️ For the first test, I’ve chosen a meme I included in a past Talking to Chatbots story. The Meme Erudite, a ChatGPT Plus GPT who specializes in overly academic and condescending meme explanations, did a pretty decent job dissecting it and explaining it. For today’s, test, after running several rounds on Vision Arena, here’s the response that was selected:

😎 Explain this meme:

A five-part meme relating concepts from statistics and data science to various dramatic scenarios depicted in stills from films or other media. The topmost part shows three individuals in a courtroom, each aiming a pistol with the labels “beliefs,” “reality,” and “statistics” superimposed on each person respectively. The second part displays a sniper with the label “data science” aiming down from a high vantage point. The third part is a historical photo of naval gunners on a battleship with the caption “learning is just a byproduct of the correlation-causation fallacy” across it. The last part shows a science fiction scene of a spaceship firing green lasers with the caption “every single person who confuses correlation and causation ends up dying.”
A five-part meme relating concepts from statistics and data science to various dramatic scenarios depicted in stills from films or other media. The topmost part shows three individuals in a courtroom, each aiming a pistol with the labels “beliefs,” “reality,” and “statistics” superimposed on each person respectively. The second part displays a sniper with the label “data science” aiming down from a high vantage point. The third part is a historical photo of naval gunners on a battleship with the caption “learning is just a byproduct of the correlation-causation fallacy” across it. The last part shows a science fiction scene of a spaceship firing green lasers with the caption “every single person who confuses correlation and causation ends up dying.” [Alt text by ALT Text Artist GPT]

🤖 This meme uses a series of images escalating in intensity to humorously present an argument about the misuse of correlation and causation in data analysis, specifically in the context of statistics, data science, and machine learning.

In the first image, labeled “beliefs,” “reality,” and “statistics,” we see characters from a film or TV show aiming guns, which can be interpreted as indicating there are often conflicting viewpoints or data points between what people believe, what is real, and what statistics show.

The second image shows a sniper labeled “data science,” depicting data science as a discipline that targets a specific point or problem with precision, perhaps refining the rough understanding provided by beliefs, reality, and statistics.

In the third image, battleship gunners labeled as “machine learning” suggest that machine learning takes action based on the targeting provided by data science. The caption associates machine learning with the “correlation-causation fallacy,” implying that machine learning algorithms might sometimes confuse correlation (things happening together) with causation (one thing causing another).

Finally, the last image, featuring a scene from a science fiction movie with multiple laser beams destroying a target, is labeled with a tongue-in-cheek statement: “every single person who confuses correlation and causation ends up dying.” This is a humorous exaggeration and a play on the concept that there is a 100% correlation between confusing these concepts and dying because, in reality, every human eventually dies, regardless of their understanding of correlation or causation.

The overall meme is poking fun at the idea that in data science and machine learning, practitioners might be prone to making logical errors such as confusing correlation for causation, which can lead to incorrect conclusions and actions. The progression from guns to battleships to a more powerful scene depicts an escalating impact of such mistakes.

gpt-4-vision-preview
A screenshot of the WildVision/vision-arena interface on Hugging Face Spaces, displaying two AI-generated explanations for a meme. The meme itself is not fully visible. The interface shows a chat window with two tabs, "Model A" and "Model B," each providing a different explanation of the meme. Model A's explanation is partially visible and mentions a man holding a gun and a ship in the background. Model B's explanation is fully visible, discussing the meme's use of images to humorously escalate the intensity of the argument about the misuse of correlation and causation in data analysis. Below the explanations are interactive voting buttons for the user to rate which model's explanation is better or if it's a tie. At the bottom, there is a "Send" button next to a grayed-out "Regenerate" button, and a "Share" button to the right. [Alt text by ALT Text Artist GPT]
A screenshot of the WildVision/vision-arena interface on Hugging Face Spaces, displaying two AI-generated explanations for a meme. The meme itself is not fully visible. The interface shows a chat window with two tabs, “Model A” and “Model B,” each providing a different explanation of the meme. Model A’s explanation is partially visible and mentions a man holding a gun and a ship in the background. Model B’s explanation is fully visible, discussing the meme’s use of images to humorously escalate the intensity of the argument about the misuse of correlation and causation in data analysis. Below the explanations are interactive voting buttons for the user to rate which model’s explanation is better or if it’s a tie. At the bottom, there is a “Send” button next to a grayed-out “Regenerate” button, and a “Share” button to the right. [Alt text by ALT Text Artist GPT]

✍️ While the interpretation of the meme by GPT-4V without any parameters or pre-prompt (unlike The Meme Erudite, which is carefully pre-prompted) is coherent and logical, it unsurprisingly failed the ‘Turing Test’ of self-deprecating humor I devised for my own sanity and peace of mind as a heavy AI user. The real intent of this meme is to poke fun at all kinds of intelligence and notions of learning, both artificial and human. However, GPT-4V ‘decided’ to creatively incorporate the computational and unnuanced notion of ‘mistake’, targeting ‘machine learning practitioners’, rather than showing any minor sign of simulated self-awareness or transcendental reasoning. A probabilistically tokenizing parrot, like all LLMs but, I admit, still the best.


2. Tree Accelerated Growth

😎 What would cause this sudden accelerated growth?

A close-up photo of a person's hand holding a piece of wood with a peculiar growth pattern, shared on the subreddit r/woodworking. The wood shows dense, regular growth rings on one end, transitioning to widely spaced rings, indicating a period of sudden accelerated growth. The post is titled "Does any know what would cause this sudden accelerated growth?" and is marked with a "Help" flair. The user who posted it is "u/killerwhale_250" and the post has received 1.4k upvotes and 132 comments. [Alt text by ALT Text Artist GPT]
A close-up photo of a person’s hand holding a piece of wood with a peculiar growth pattern, shared on the subreddit r/woodworking. The wood shows dense, regular growth rings on one end, transitioning to widely spaced rings, indicating a period of sudden accelerated growth. The post is titled “Does any know what would cause this sudden accelerated growth?” and is marked with a “Help” flair. The user who posted it is “u/killerwhale_250” and the post has received 1.4k upvotes and 132 comments. [Alt text by ALT Text Artist GPT]

🤖 The patterns you see in the wood are growth rings, indicating the annual increase in girth of the tree. Each ring typically represents one year of growth, with the variations in width reflecting the growth conditions during each year. The sudden increase in the width of the rings, which you referred to as “sudden accelerated growth,” could be due to several factors, which often include:

  1. Environmental conditions: A period of favorable conditions such as an abundant water supply, ideal temperature, and high nutrient availability can lead to a spurt in growth.
  2. Change in competition: If nearby trees or vegetation that were competing with the tree for resources are removed or die, the remaining tree might experience a surge in growth due to the reduced competition for water, light, and nutrients.
  3. Injury or trauma: If a tree is damaged or parts of it are removed, it might respond by increasing its growth rate in order to recover.
  4. Human intervention: Forestry practices such as thinning, which selectively removes some trees to reduce competition, can promote growth in the remaining trees.
  5. Genetic factors: Some trees have genetic traits that enable them to grow more quickly when conditions are right.

It’s important to note that a wide ring does not always correlate with a single year of growth. In some cases, favorable conditions might result in multiple growth flushes within the same year, leading to the formation of “false rings” which can appear as an acceleration in growth. Interpreting tree rings for specific growth events can be complex and may require expertise in dendrochronology, the study of tree rings, to understand the exact causes behind the patterns
observed in the wood.

gpt-4-vision-preview

✍️ Again GPT-4-Vision was superior in this case, but I don’t want to miss the chance to share a response by Llava that would be a serious candidate for the LLM Hallucination Hall of Fame (along with Bing’s Microverse Battery)… If such a thing existed. Please, it would be a great service to humanity if someone with enough social media reach created this. I foresee sudden accelerated growth for such a project, more than that of the tree rings in the picture:

🤖 The sudden acceleration of the growth could be attributed to the image being posted on social media, as suggested by the message “I posted this on Instagram.” The image features a piece of wood with a tree ring, which is a natural phenomenon that occurs when a tree grows and ages. The growth of the tree ring indicates that the tree has experienced significant changes in its growth patterns, and the image has captured this moment. The person holding the wood piece might have wanted to share this unique and interesting aspect of the tree’s growth with their followers on social media, which led to the sudden acceleration of the growth’s visibility. 🚬 🚬 🚬

llava-v1.6-34b, social dendrochronologist 🚬 🚬 🚬
A screenshot from the WildVision/vision-arena interface on Hugging Face Spaces, showing a Reddit post from r/woodworking within the chat area. The post questions the cause of sudden accelerated growth in a piece of wood, accompanied by a photo of the wood with visibly wider growth rings. Two AI models, labeled “Model A: gpt-4-vision-preview” and “Model B: blava-v1-6-34b,” provide explanations on the side panels. Model A suggests environmental changes or release from competition for resources as potential causes. Model B humorously suggests the growth acceleration could be due to the image being posted on social media. Voting buttons and a “Send” button are visible at the bottom, with the interface inviting users to interact and assess the models’ explanations. [Alt text by ALT Text Artist GPT]

Before checking the answers by GPT-4-V, I had selected a few answers on the Reddit post that I expected would help as evidence that AGI is still a long way ahead and we’ll still need to rely on humans for a while… but I certainly don’t know what to think. Anyway, what would be of Reddit and social media without genuine human curiosity and overconfidence?

This example was a good excuse to create a new self-deprecating meme test…


3. IQ Bell Curve: Llava < GPT-4V < Reddit

😎 Explain this meme:

The IQ Bell Curve meme juxtaposes a Reddit post about wood growth rings with caricatures labeled "GPT-4-V" and "REDDIT" on opposing sides of an IQ distribution graph.
The IQ Bell Curve meme juxtaposes a Reddit post about wood growth rings with caricatures labeled “GPT-4-V” and “REDDIT” on opposing sides of an IQ distribution graph. [Alt text by ALT Text Artist GPT]

🤖 This meme is a humorous commentary on how different entities or communities might react to a certain anomaly or unexplained phenomenon. In this case, the phenomenon is represented by a photo of a wooden beam with an unusual growth pattern, which prompts a question about what could cause such a sudden accelerated growth. The meme features three different reactions:

  1. “GEMINI” is represented as a somewhat simplistic or indifferent character, unbothered by the complexity of the question.
  2. “GPT-4-V” (likely referring to a version of the Al program from OpenAl, such as GPT-4) is depicted with a pseudo-intellectual persona wearing glasses taped together, suggesting that this entity might overanalyze the question or come up with a complex or overly technical explanation.
  3. “REDDIT” is shown as a person wearing a hood, likely representing a typical user on the Reddit platform. The character’s cautious and mysterious demeanor might indicate that the Reddit community would engage in conspiracy theories or propose speculative or sensational explanations for the anomaly.

The image is overlaid with a bell curve graph, seemingly portraying the distribution of intelligence or perhaps rationality in reactions to the phenomenon, with “GEMINI” and “REDDIT” at the extremes and “GPT- 4-V” near the middle-high intelligence range, though still exhibiting some level of eccentricity. The humor lies in the exaggerated stereotyping of each actor’s supposed way of thinking.

gpt-4-vision-preview
A screenshot from the WildVision/vision-arena interface on Hugging Face Spaces showing two AI models’ explanations of a meme. The meme includes a bell curve graph with a figure at each end and one in the middle, each labeled with different terms like “GEMINI” and “GPT-4-V”, and a central figure labeled “REDDIT”. [Alt text by ALT Text Artist GPT]

✍️ Again, GPT-4-Vision is overly superior to the rest, given the context was very limited and I was essentially hoping for creative hallucinations, as in the previous test. Challenges, to name a few, include: Bard’s rebranding as Gemini, which is very recent; the meme can only make full sense if you’ve read this post; the IQ Bell Curve meme is undoubtedly ‘politically incorrect’…


4. ControlNet Canny and CLIP Interrogator: a Multimodal AI Example

Black and white line drawing generated by the ControlNet Canny model, depicting a woman in a wetsuit holding a surfboard on the beach. The text from the CLIP Interrogator, describing the image, is superimposed over the drawing and reads: "a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism." [Alt text by ALT Text Artist GPT]
Black and white line drawing generated by the ControlNet Canny model, depicting a woman in a wetsuit holding a surfboard on the beach. The text from the CLIP Interrogator, describing an image, is superimposed over the drawing and reads: “a woman in a wet suit holding a surfboard on the beach with waves in the background and a blue sky, promotional image, a colorized photo, precisionism. [Alt text by ALT Text Artist GPT]

😎 This is an illustration for an article about multimodal AI. It juxtaposes two independent elements (merged into one illustration for simplicity): 1) a ‘control image’ created by the ControlNet Canny model applied to a source image, and 2) a text generated by CLIP Interrogator, a prompt engineering tool that produces text from an input image. Explain the concept of multimodal generative AI, incorporating a reference to the two elements in the illustration (an image and a text) as a practical example.

A screenshot of the WildVision/vision-arena interface on Hugging Face Spaces displaying a comparison between two AI models' interpretations. On the left side, Model A presents a line drawing of a woman in a wetsuit holding a surfboard on the beach with waves and a blue sky in the background, labeled as a "colorized photo, precisionism." Adjacent to this, the right side shows text describing an article about multimodal AI, discussing the merging of a 'control image' by the ControlNet Canny model and text by CLIP Interrogator as an example of multimodal generative AI. Below each model's output, there are options for viewers to vote on which model performed better.
A screenshot of the WildVision/vision-arena interface on Hugging Face Spaces displaying a comparison between two AI models’ interpretations. On the left side, Model A presents a line drawing of a woman in a wetsuit holding a surfboard on the beach with waves and a blue sky in the background, labeled as a “colorized photo, precisionism.” Adjacent to this, the right side shows text describing an article about multimodal AI, discussing the merging of a ‘control image’ by the ControlNet Canny model and text by CLIP Interrogator as an example of multimodal generative AI. Below each model’s output, there are options for viewers to vote on which model performed better. [Alt text by ALT Text Artist GPT]

🤖 Multimodal generative AI refers to AI systems that can process and generate data in multiple modalities, such as images, text, and audio. These systems can combine information from different sources to create more complex and diverse outputs. For example, a multimodal AI AI system could take a photograph and a text description of a scene and generate a coherent and more detailed illustration that captures both the visual and textual elements of the scene.

CogVLM-Chat, an open-source visual language model

✍️ Although the illustration might seem straightforward and educational for a human (we understand our purpose of generating a meaningful image by combining a sketch with a text description), this proved to be a very challenging task for all tested chatbots. I only selected a brief, but quite coherent, answer by CogVLM, as most other models consistently failed to interpret the ‘sketch’ image generated by the neural network in ControlNet Canny as a mere technical output that has no meaning or value on its own for a human: they tend to simplistically categorize the black and white sketch as a ‘stylized image’, failing to interpret we just use it as a pedagogical way to illustrate how an overly simple multimodal generative AI works: a machine helps us by combining an image input with a text input that are no valuable in their own to create a visualization that is meaningful to us.

Additionally, large language models face a paradoxical challenge when facing questions about artificial intelligence. This paradox might be overcome in the near future, but the problem lies in that training data overwhelmingly consists of text written before large language models or language-vision models existed or were mainstream. Therefore, the ability of LLMs to interpret and explain concepts related to generative AI is, paradoxically considering these concepts represent their own ‘nature’ and existence, quite limited and in clear disadvantage compared to human-generated content such as tutorials and articles. This is, in my opinion, an interesting insight into how far we are from reaching anything close to Artificial Consciousness, the paradox being we humans develop new technology faster than our machine-learning models can consolidate their knowledge and ‘understanding’ of it.

The level of abstraction our brain uses when defining and interpreting workflows or design processes is, IMHO, different from that of any known computer algorithm and proves there is no point in the current obsession with benchmarking AI models against humans because…

A meme showing a person dressed in a business suit with a solemn expression. The overlaid text at the top reads, "My biases, mistakes and hallucinations are a product of free will," and at the bottom, it states, "We are not the same." The person is pointing to themselves with their right hand. The background has a muted blue hue, adding to the serious ambiance of the image. [Alt text by ALT Text Artist GPT]
A meme featuring actor Giancarlo Esposito dressed in a business suit with a solemn expression. The overlaid text at the top reads, “My biases, mistakes and hallucinations are a product of free will,” and at the bottom, it states, “We are not the same.” The background has a muted blue hue, adding to the serious ambiance of the image.

Yes, there is opinion, bias, humor, and even ideology, in the statement made by this meme, and I certainly didn’t care about the concepts of specificity, coherency, or factual accuracy when I made it… Those are just some examples of metrics (two of them are half of the ‘SCBN’ benchmark I use here) that make sense for evaluating chatbots against chatbots, and AIs against AIs, but there is no logical or productive way to apply them to humans. That’s one of the reasons why I like projects such as the LMSYS Chatbot Arena and the WildVision Arena, because in the idea of machines battling machines lies a core principle we should always apply in our relationship with AI: we are not the same, no matter how good the imitation, probabilistic or stochastic (the link is to a ChatGPT chat which I will soon evolve into a story for this blog), is.

By the way, I haven’t tested the AI-inspired version of the ‘We Are Not the Same‘ meme on any vision-language model or chatbot yet, so maybe you want to give it a try.


Vision-Language Models: The SCBN Chatbot Battle

To conclude, here’s my purely human, biased, subjective judgment of multimodal chatbots, based on my four tests on WildVision Arena:

Chatbot Battle: Vision Arena

Chatbot Rank (SCBN) Specificity Coherency Brevity Novelty Link
GPT-4V 🥇 Winner 🤖🤖🕹️ 🤖🤖🤖 🤖🤖🕹️ 🤖🤖🕹️ Model details
CogVLM 🥈 Runner-up 🤖🤖🕹️ 🤖🕹️🕹️ 🤖🤖🤖 🤖🤖🕹️ Model details
Llava 🥉 Contender 🤖🕹️🕹️ 🤖🕹️🕹️ 🤖🤖🕹️ 🤖🕹️🕹️ Model details
Gemini 🕹️🕹️🕹️ 🕹️🕹️🕹️ 🤖🤖🕹️ 🤖🕹️🕹️ Model details

Leave a Reply

Your email address will not be published. Required fields are marked *

*