Dimensions for thinking about AI Model applicability
I was listening to the 2nd episode of the Latent Space podcast, and really loved the way Varun Mohan from Codeium broke down some dimensions for thinking about LLMs, or models in general, and how they need to work in different domains.
In particular, he laid out the dimensions of latency, quality, and correctability.
These dimensions form axes that may vary in importance based on the domain. By manipulating them, we can tailor our approach to meet the specific needs of a given situation. Interestingly, latency and quality tend to have a direct trade-off in the model creation process. If you can't lower the latency to achieve the quality you need, you might need to institute some latency hiding techniques. Correctability, on the other hand, is linked to the model's intended use and how it's integrated into a user experience.
To think about this, let’s flesh out how these axes might play out in a few different current hot AI domains.
Code Generation
In the realm of code generation, particularly when integrated into an editor, the priority is to maintain a low latency. You need to avoid disrupting the user's flow, similar to how a fast compile/reload/test cycle prevents the loss of mental context waiting for newly coded material to be tested. The same principle applies to codegen, where a swift prompt/output cycle allows for a higher level of cognitive engagement with the functions being written.
Quality is of moderate importance here. If it's more challenging to compile generated code than the code I wrote myself, the tool is not adding value. However, it can still provide significant value even if it's only effective for relatively boilerplate cases.
Correctability feels like a lever you can play with. The more your quality goes up the less correctability you'll need. Correctability in code is high if it is suggesting small changes directly in your editor. The larger the blocks that are generated at once and the more steps that are automated (e.g. creating full pull requests), the harder it is to correct.
Writing Blog Posts
When it comes to blog post writing, latency is only moderately important. If it takes five minutes to generate a draft post, that's still significantly faster than my current writing pace.
Quality is important to me, but it's a sliding scale. If the output quality is too low, the model is not useful. But above a certain threshold, the higher the quality, the less editing I have to do.
Correctability is the key factor in this domain. I wouldn't want to use a model that pushes content directly from generation to publication. Instead, I prefer a model that generates a draft I can review and edit.
I used ChatGPT to help me with drafting this post, going from bullet points to a first draft, but it required substantial editing before I was happy with it. I would be terrified of an approach that pushed articles directly from a model to publication.
Learning
In learning scenarios, latency is important up to a point. A real-time question-answer cycle can create an exploratory flow, similar to the flow in coding.
Quality is paramount here. We want to ensure the information we're learning is accurate.
Correctability can be challenging in this domain. How do we know if what we've learned needs to be corrected?
I think the learning usecase introduces the need for an additional dimension—'traceability' or 'verifiability'.
Verifiability
When using models for learning new concepts or facts, it's crucial to validate with independent sources to ensure we're learning something factual, not confabulated by the machine. For this usecase (and probably for others), we need some way to understand how the model generated its output, and a way to validate it.
Using "Cite your sources" in the prompt with GPT-4 seems to work reasonably well for this, at least in some topic areas. It worked remarkably well for me when I was learning about foundation models, as it pointed me to actual papers that substantiated the summaries provided by chatGPT. However, it fell short when I was trying to explore mental health outcome data—the model made up some numbers, and then made up some sources. The sources did not actually exist. However this at least let me know not to trust the info it had given me.
Working through some examples
For example, we can use this to evaluate search as a domain. We’ve seen initial stumbles here (bing chat coming to mind), and one way we can think of this is that in search, we need latency, correctness, and verifiability. Especially when we’re searching for factual things. But current language models are poor at having high latency and high correctness at the same time, and they are not particularly good at verifiability. This indicates that applying these directly to replace search will be difficult.
Similarly taking an example from an OpenAI plugin, ordering plane tickets online using a chatbot. In this case latency may not be super important, but quality and correctability are key. And correctability may even supercede quality because when I'm spending money on plane tickets even if I get the best possible answer, it may be that the price is beyond what I'm willing to spend, and I need to correct it. There's also multiple dimensions of "quality" (price, time, connections) that I may want to optimize for, and which dimension I want to choose may vary as I see the possibilities.
Does this imply we shouldn’t use a chatbot for ordering plane tickets? Not necessarily, but we’ll need to build in lots of affordances (whether directly through chat or as an additional step) that allow for correcting the generated response before spending money.
Moving forward
Right now one of the biggest challenges in the AI space is understanding what the edges are. What will these machines be good at, and what won’t they?
We have some fundamentally new capabilities available to us, how can we use those to create new products and services that will make our lives better?
I don’t think we’ve got that anywhere closed to mapped out, but these dimensions provide a lens for thinking about new possible domains for AI. As we consider them, we can use the dimensions of latency, quality, correctness, and possibly verifiability to determine what constraints we have on our building, and if the tools we have available will be a good fit at all.