Few Shot Learning in Public

Shifting from models to applications

KBall — Sun, 03 Mar 2024 18:59:26 GMT

I’ve neglected publishing here for the last year or so, but during that time the focus of my learning about machine learning has shifted from a focus on understanding models, and in particular large language models and towards how to build useful applications using those models.

This distinction is one that is very deliberately burry in a lot of the marketing about AI. People talk about ChatGPT, Claude, or Gemini and they speak as if this is all a single model. But really the model is a component in a larger architecture - perhaps a central component, but a component nonetheless.

DALL-E generated image for a diagram with an LLM in the center

Some of the other elements that go into an LLM-based application include

Some sort of structured user interface and organization for interaction. In the chat-focused applications above this is usually the concept of a thread of messages.
A set of (sometimes interacting) prompts and pipes between different information. In a YouTube video about Gemini reasoning about user intent, they break down that responding to a single user interaction involves an entire workflow with several decision steps. Each of those steps is either an algorithmic process or an interaction with a model, resulting in some structured or unstructured information that gets taken to the next step.
Often some sort of structured/known data source that can be searched and fed in as context
Often some sort of conceptual “memory” and way to keep track of what has already happened.

Emerging AI Architectures

Fleshing these out a little bit, there are two big architectural patterns that have emerged for building applications around LLMs: Retrieval Augmented Generation (RAG) and Agents.

A few points about each:

Retrieval Augmented Generation

Core architecture for layering proprietary/domain knowledge into chat interaction.
Essential mental model:
- Use query/prompt to have LLM generate a search
- Load relevant documents from search
- Use documents + initial prompt as context to LLM to generate response
Stereotypical example: “Chat with this project’s documentation”

There’s a great article on the langchain blog that goes deeper into how RAG works and the different elements of it.

Some of the benefits of using RAG:

It’s a relatively straightforward way to create a unique and valuable chat/textual interaction
Can work “globally” (e.g. All customers search these documents) or “locally” (e.g. we search content from your specific documents)
It has “better” factuality / reduced hallucinations
There’s a reduced dependency (relative to fine tuning or other mechanisms to customize) on training data
Can link to / directly reference source material.

That last one is key for building traceability and the ability to verify factuality into an AI application. Because the model has no concept of “truth” and (at least in today’s generations) no way to trace back the source material for any particular thing that was generated, having a way to connect (outside the LLM) to the original source is the only way I have found where you can create true traceability & ability to verify factuality.

Generative Agents

Architecture for creating evolving system.
Core components
- Memory stream
- Reflection/summarization
- Planning
Stereotypical example: AI-based game characters that learn and evolve

One of the classic examples of using a generative agent is in this academic research paper that explores an entire game filled with characters that learn, evolve, and interact.

Variations on agentic approaches are being attempted all over, but most of the well-published examples (e.g. Mini-AGI) demo well but break down quickly in production.

Some of the benefits of Generative Agents are they

Can learn and improve behavior over time (even without model changes)
Maintain history of previous interactions

The Application Layer is the frontier for LLMs

At this point the power of the big LLM foundation models is very well established, but what feels much more unproven is how to build actual useful and valuable applications with them.

The most successful examples so far have been coding assistants (Which follow a variation on RAG — super interesting conversation around this in this podcast) and summarization assistants for meetings.

Outside of those, there have been a few niche consumer successes like ChatPDF or PhotoAI but much of the world still feels like it is grappling with how to use LLMs effectively, with as many epic failures as successes.

There’s a lot to learn and figure out here… if you’re reading this and curious, one least thing you might be interested in is joining the AI in Action discussion group that happens weekly organized as a part of Latent Space. You can find that (and other events like ML focused paper clubs) at the Latent Space Luma.

Subscribe now

Thinking about Latent Space

KBall — Wed, 10 May 2023 13:31:07 GMT

I think one of the most powerful ways for thinking about what an LLM is actually doing is this concept of a "Latent Space".

Image generated by stable diffusion with the prompt “3d visualization of ‘latent space’”

What is a latent space? Here's a description from Chat-GPT (that yes, checks out with what I've learned)

Latent space refers to an abstract space that represents a compressed or hidden representation of complex data, often high-dimensional data such as images, videos, or sound. In machine learning, the term is frequently used in the context of generative models, such as autoencoders and variational autoencoders, where a latent space is learned by mapping the high-dimensional input data to a lower-dimensional latent space.
In this lower-dimensional latent space, the data is represented in a more compact and meaningful way, allowing for easier manipulation, exploration, and generation of new data. The latent space is often designed to have desirable properties, such as continuity and smoothness, that make it possible to interpolate between different points in the space, and generate novel data that follows the underlying distribution of the original data.

*Interestingly, this was one example where GPT-3.5 did better than GPT-4. The GPT-4 answer got bogged down in too many details and examples and didn't do a great job of explaining the core concept.

Describing an LLM using the concept of Latent Space

Conceptually, an LLM is doing a few things. Breaking it down using a transformer model (as I explored in https://www.fewshotlearning.co/p/trying-to-understand-transformer) the LLM does the following:

Transform from a set of language tokens into a token-based "embedding" or latent space of word concepts.
Use this word-based representation plus a positional vector as input to the encoder side of our transformer, which transforms the chunk of token-based embeddings into a location in a different (language concept based? phrase based?) latent space. I'll tentatively describe the vectors that describe a position in this latent space as "concept vectors".
The decoder then is essentially applying a mapping function, trained over a large amount of data, to move from the position in latent space described by the current content vector to the next most likely position in latent space, as described by our current position plus a single token. This mapping function is part of what has been trained over all of the millions of documents of training data.
The token is output, and the decoder us run again with it and all previous output tokens added to the input until it generates the next token. Steps 3 and 4 are applied iteratively until you reach a stop point.

One way we might conceptualize what has happened as language models have grown is that they have both gotten a much more detailed "latent space", and a much more precise mapping function.

All of the exercises in prompt tuning are about setting up the right starting position in latent space. "Think step by step, cite your sources" is moving us to a part of the latent space where the language outputted includes breaking things down into steps and adding source links.

A visualization of a function mapping from one place to another in an abstract space. Such as might happen by prompting an LLM to “Think step by step”.

Implications for what LLMs are and aren’t good at

The latent space encodes both linguistic patterns and knowledge, captured by the training data. This is what allows an LLM like GPT-4 to not only handle language tasks, but to share and explore knowledge about the world.

However understanding the core model is a key concept. The LLM is mapping over a space that is purely derived from language. When we see LLMs reproducing what we might describe as higher order reasoning, they're not doing it the same way we might do so. We use language as an interface to other types of mental models, but for LLMs language is all there is.

It's hard to reason about what this means in the abstract, so let's look at an example.

An example using logical reasoning

One of the interesting emergent properties evident in the latest LLMs is the emergence of what appears to be logical reasoning. Especially when we prompt the LLM with something like "chain of thought prompting", the LLM appears to be able to reason logically.

I don't believe that there is an embodied concept of reasoning, but rather that there is an area of latent space along some dimensions of the concept vector that captures the parts of the training data that look like logical reasoning. If you are able to move the model into that part of the space, it will reproduce what looks like logical reasoning.

The fidelity of that logical reasoning is still somewhat suspect. Sometimes it works, sometimes it doesn't.

GPT-4 is way better than prior models at it, but still gets things wrong pretty frequently.

And I think that is related to this difference in what is happening. The LLM doesn't have an underlying representation of a logical model or logical reasoning that it can somehow "check its work" against. It has instead a high-dimensional vector space that has some number of dimensions that represent "logical-like" arguments, and a mapping function that attempts to reproduce those.

Implications for LLMs as general purpose agents.

I think this comes to a core limitation of LLMs as "general purpose" agents.

Humans use language as an interface to attempt to communicate about other underlying models. Using the logic example, there is a relatively straightforward model for how logic works that can be described as a relatively simple set of rules. When we talk about logical problems, we are mapping between particular situations (described using language) and that underlying logical model.

LLMs have no such underlying logical model. They have linguistic model. By training it with very large numbers of logical examples, it can get to pretty high fidelity on reproducing what looks like logic, but it both is inefficient in that representation relative to our simpler rule-based model, and it will tend to fail in places where small linguistic changes imply large logical changes.

Does this imply LLMs cannot derive these underlying models?

I'm not sure that I'm willing to go that far. There is definitely research showing the abilities of these models to represent underlying rules and state when properly trained.

However, it seems pretty clear with the current generation of LLMs that they have not managed to derive an underlying logical model, or they wouldn't fall victim to the types of mistakes they do now. And OpenAI's Sam Altman seems to be indicating that we've reached the end of gains to be made simply by scaling up models.

It's possible that we'll be able to train multi-modal models that address this by training on many different types of data. Apparently GPT-4 can do this to some extent with images & text (but that has not been released to the public generally, leading me to believe it's got a lot of edge cases and issues and needs to be pretty carefully constrained)

Instead, I think it would point to a future that looks much more like the latter half of that article: multiple models wrapped up inside of applications, where large numbers of domain-specific models are integrated together with some sort of interface layer.

In other words, a lot like our existing AI world. Except what LLMs do provide is an extremely powerful interface layer, where we can ask for what we want using natural language, and it can interpret that natural language to understand what model is likely to be the best at answering our question.

Looking towards the future

I'm using this mental framework for a few different things.

First, to try to better understand what LLMs are and more importantly are not going to be good at themselves. I'll flesh this out in future posts, but I think a broad way of thinking about this is that the better a domain is modeled by language, the better an LLM will do at it. And the more small linguistic changes mean big domain changes, the worse an LLM will do.

A quick example is around asking for statistics - when asking an LLM about the world (say what percentage of people have anxiety disorders), the difference between 21% and 42% is extremely small linguistically, but makes a massive difference in our model of the world.

Second, to think about how to integrate LLMs into applications. I'm looking strongly at things I do using text, and trying to figure out applications take advantage of the LLM strong understanding of text to make my processes better.

Third, to try to understand what our risks are of some sort of "True AGI", with either utopian or dystopian outcomes. There's a lot of very smart people who are concerned here, but based on what we've seen here I think the LLM advancements are not a massive acceleration in the danger curve.

They have provided a step-function increase in our ability to parse and do things with natural language, which is an extremely powerful general-purpose technology, and now there is a tremendous excitement and rush of people figuring out now ways to apply this technology. That creates a massive excitement, lots of new people and money jumping into AI, and probably does increase the likelihood and speed at which we'll arrive at something that looks like AGI.

But I see no evidence that LLMs themselves have put us near to that, and the projections of "AGI in the next 5 years" are IMO pure hyperbole.

Trying to understand Transformer Models

KBall — Sat, 06 May 2023 23:50:03 GMT

The fundamental breakthrough that appears to have led to the current “Cambrian Explosion” around language models was the invention of the Transformer architecture. If I’m understanding it properly, this new way of arranging neural networks dramatically simplified the way we represent contextual information about how a word fits in a sentence, allowing us to encode it in some vectors that can be passed along, which in turn allows these models to take what was once a serial process and parallelize it, running many tokens in parallel at once.

Stable Diffusion generated image for ‘Transformers inside of a computer’

This huge efficiency gain allowed much much larger models to be trained much more rapidly, and as model size has gone up, there have been both predictable improvements and surprising emergent capabilities.

So now I’m trying to understand what these transformer models are and how they work. Here’s what I’ve got so far; if you’re reading this and anything doesn’t sound right please let me know.

Transformer Models

The original paper that introduced the approach is pretty dense, and I found myself reading it multiple times to try to understand what different parts mean. I've found https://jalammar.github.io/illustrated-transformer/ to be the most useful resource to transformer models so far. It is phenomenal, go and read it, it will likely do better than I will in explaining this.

Here's how I'm understanding it: Transformer models have two core pieces, an encoder and a decoder. The encoder's job is to start with an input vector representing a series of tokens, and passes it through a series of steps to attempt to "encode" the relationships between those tokens. These encoded values are stored in three vectors (called the 'query', 'key', and 'value' vectors). The decoder then uses these vector representations to predict new symbols, one at a time.

Image from https://jalammar.github.io/illustrated-transformer/ showing a simplified explanation of how transformers work

Encoders

The encoder is a stack of identical layers. In the original introduction of the approach, the stack size was 6 layers, though it doesn't seem like that is a "magic" number and it's entirely possible different models might use a different number of layers.

Each layer of the encoder is a feed-forward neural network combined with a 'self-attention layer'. For the first layer it takes in the embedding of your text (essentially a mapping of the original words into a numerical vector space) as the input, while each subsequent layer uses the output of the previous layer as its input.

Image from https://jalammar.github.io/illustrated-transformer/ showing a simplified explanation of an encoder layer

The self-attention layer takes the vector input and transforms it in three ways using 3 matrices to create 3 new vectors. The value of the matrices are generated as a part of training. For those familiar with other neural network based models, one of these vectors is the direct output of this network layer (called the 'query' in articles about the transformer architecture) while the other two represent hidden state from the neural network (and are called the 'key' and 'value' vectors).

These vectors give a way for the model to understand as it analyzes each symbol (probably a word), the importance of any of the other symbols in the sentence.

For example, if you have a sentence that looks like "I petted my cat and she was very happy", this will allow the model to represent the relevance of "my cat" when analyzing the word "she".

One of the key advances of the transformer model was creating these (relatively) simple representations of how information moves forward from one step to the next through the model, which allows relatively fast computation of a large number of words in parallel. This encoder step essentially turns into a series of wide vector multiplications, which is why GPUs (and TPUs) are the core underlying processing technology that lets these drive forward.

However, because words are run in parallel, the model needs a mechanism to capture their position in the input sentence. This is done using a 'positional' encoding, which is a vector that is computed based on the position of the symbol/word, and added to the word embedding vector before it is processed by the self-attention vector. There have been several proposed positional functions that have been tried in different approaches. The important characteristics of this function are that it generates unique values for each position, and that it is easy to understand "relative" positioning (ie the functions vary continuously based on position). During training, the model will then be able to incorporate these values in its training.

Decoders

The decoder phase is pretty similar to the encoder phase, consisting of a set of stacked layers with attention layers and feed-forward neural networks. The difference is that in a decoder, there are two attention layers before the neural network. The first attention layer is fed any symbols already generated by the transformer. In the first time point, this may be an empty vector, but as it generates symbols these become the "prompt" for each additional step. The second attention layer takes the output of this empty vector and incorporates the "key" and "value" vectors generated by the encoder phases. This allows the model to incorporate all context from the original prompt.

Image from https://jalammar.github.io/illustrated-transformer/ showing a simplified explanation of a decoder layer.

Multiple of these decoders are then stacked. At each layer beyond the first, the input is the output of the previous layer, but the "key" and "value" vectors from the encoder are applied in the same way, allowing every layer of the model to incorporate the encoded prompt.

At the top of the stack, the vector output is run through a process called 'output projection" which consists of very wide linear layer that maps back from the fixed vector size of the transformer into a vector the size of the token library (in English this might be every possible word). This new vector is normalized to a probability distribution, and a single predicted symbol is output (either the most probable, or through some other sampling methodology). Now that you have a new token in the output, the entire decoder stack is run again, with that token (and all previously predicted tokens) as the new input.

Creating your own models

If you're creating your own model, these core "encoder" and "decoder" abstractions are available in standard machine learning libraries such as pytorch and tensorflow. As much as I wanted to know what they were doing, to use them we can treat them as black boxes.

Embedding

One more key concept needed to understand how these things are working under the hood.

Before and after these phases, we need to do some sort of translation between the original form of the input data (human language for a large language model) and the vector representation these models understand. This is done via 'embedding', which is a term used to describe the mapping of any form of input into a vector space. In the case of words, this maps from the series of free text inputs into a vector of continuously varying number values.

There are a variety of embedding algorithms out there, but some of the goals of a good embedding algorithm are to capture semantic similarity (two words with similar meanings should result in similar values in the embedding space) and to have lower dimensionality than raw text (in fact as low a dimension as possible so we can pack the most meaning into a single vector operation)

These embeddings algorithms are themselves often machine learning models that have been trained on a wide range of data.

OpenAI exposes embedding directly as an API. There are also a variety of other options - Langchain includes models for interacting with 13 different embedding approaches as of this writing. If I'm understanding things correctly, for large language models typically these consist of a word/token level embedding added to a positional embedding (to let the model understand where a token is relative to the tokens around it).

Embeddings are the "translation" layer between human language and these deep learning models that are the engines of LLMs like GPT-3 and GPT-4.

Next steps

I’m not sure I need to dig any deeper on transformers themselves; I like having an understanding of how things work, but I’m more interested in applications than in building new models. Next week I’ll be digging into building my first LLM-based application using LangChain.

As always, if something in this article doesn’t match your knowledge or understanding, please let me know! And send me any recommended reading or listening to help me learn faster.

Dimensions for thinking about AI Model applicability

KBall — Fri, 05 May 2023 16:42:57 GMT

I was listening to the 2nd episode of the Latent Space podcast, and really loved the way Varun Mohan from Codeium broke down some dimensions for thinking about LLMs, or models in general, and how they need to work in different domains.

In particular, he laid out the dimensions of latency, quality, and correctability.

Blog image generated using Stable Diffusion with prompt “three dimensional cube graph with an artificial intelligence in the background”

These dimensions form axes that may vary in importance based on the domain. By manipulating them, we can tailor our approach to meet the specific needs of a given situation. Interestingly, latency and quality tend to have a direct trade-off in the model creation process. If you can't lower the latency to achieve the quality you need, you might need to institute some latency hiding techniques. Correctability, on the other hand, is linked to the model's intended use and how it's integrated into a user experience.

To think about this, let’s flesh out how these axes might play out in a few different current hot AI domains.

Code Generation

In the realm of code generation, particularly when integrated into an editor, the priority is to maintain a low latency. You need to avoid disrupting the user's flow, similar to how a fast compile/reload/test cycle prevents the loss of mental context waiting for newly coded material to be tested. The same principle applies to codegen, where a swift prompt/output cycle allows for a higher level of cognitive engagement with the functions being written.

Quality is of moderate importance here. If it's more challenging to compile generated code than the code I wrote myself, the tool is not adding value. However, it can still provide significant value even if it's only effective for relatively boilerplate cases.

Correctability feels like a lever you can play with. The more your quality goes up the less correctability you'll need. Correctability in code is high if it is suggesting small changes directly in your editor. The larger the blocks that are generated at once and the more steps that are automated (e.g. creating full pull requests), the harder it is to correct.

Writing Blog Posts

When it comes to blog post writing, latency is only moderately important. If it takes five minutes to generate a draft post, that's still significantly faster than my current writing pace.

Quality is important to me, but it's a sliding scale. If the output quality is too low, the model is not useful. But above a certain threshold, the higher the quality, the less editing I have to do.

Correctability is the key factor in this domain. I wouldn't want to use a model that pushes content directly from generation to publication. Instead, I prefer a model that generates a draft I can review and edit.

I used ChatGPT to help me with drafting this post, going from bullet points to a first draft, but it required substantial editing before I was happy with it. I would be terrified of an approach that pushed articles directly from a model to publication.

Learning

In learning scenarios, latency is important up to a point. A real-time question-answer cycle can create an exploratory flow, similar to the flow in coding.

Quality is paramount here. We want to ensure the information we're learning is accurate.

Correctability can be challenging in this domain. How do we know if what we've learned needs to be corrected?

I think the learning usecase introduces the need for an additional dimension—'traceability' or 'verifiability'.

Verifiability

When using models for learning new concepts or facts, it's crucial to validate with independent sources to ensure we're learning something factual, not confabulated by the machine. For this usecase (and probably for others), we need some way to understand how the model generated its output, and a way to validate it.

Using "Cite your sources" in the prompt with GPT-4 seems to work reasonably well for this, at least in some topic areas. It worked remarkably well for me when I was learning about foundation models, as it pointed me to actual papers that substantiated the summaries provided by chatGPT. However, it fell short when I was trying to explore mental health outcome data—the model made up some numbers, and then made up some sources. The sources did not actually exist. However this at least let me know not to trust the info it had given me.

Working through some examples

For example, we can use this to evaluate search as a domain. We’ve seen initial stumbles here (bing chat coming to mind), and one way we can think of this is that in search, we need latency, correctness, and verifiability. Especially when we’re searching for factual things. But current language models are poor at having high latency and high correctness at the same time, and they are not particularly good at verifiability. This indicates that applying these directly to replace search will be difficult.

Similarly taking an example from an OpenAI plugin, ordering plane tickets online using a chatbot. In this case latency may not be super important, but quality and correctability are key. And correctability may even supercede quality because when I'm spending money on plane tickets even if I get the best possible answer, it may be that the price is beyond what I'm willing to spend, and I need to correct it. There's also multiple dimensions of "quality" (price, time, connections) that I may want to optimize for, and which dimension I want to choose may vary as I see the possibilities.

Does this imply we shouldn’t use a chatbot for ordering plane tickets? Not necessarily, but we’ll need to build in lots of affordances (whether directly through chat or as an additional step) that allow for correcting the generated response before spending money.

Moving forward

Right now one of the biggest challenges in the AI space is understanding what the edges are. What will these machines be good at, and what won’t they?

We have some fundamentally new capabilities available to us, how can we use those to create new products and services that will make our lives better?

I don’t think we’ve got that anywhere closed to mapped out, but these dimensions provide a lens for thinking about new possible domains for AI. As we consider them, we can use the dimensions of latency, quality, correctness, and possibly verifiability to determine what constraints we have on our building, and if the tools we have available will be a good fit at all.

Learning in public about LLMs and AI

KBall — Fri, 05 May 2023 15:26:55 GMT

I’m deep diving down the LLM and AI rabbit hole, and taking inspiration from the one and only swyx, I’m going to learn in public. This substack is intended to document my progress, share my learnings, and give plenty of opportunities for folks who know more to correct me and push me in better directions.

Subscribe to follow along and learn with me, but regardless don’t hesitate to reply or send me notes telling me I’m wrong, pointing me in new directions, or otherwise helping in this learning journey.

Subscribe now