Skip Navigation

Looking for resources to better understand LLMs

I do not believe that LLMs are intelligent. That being said I have no fundamental understanding of how they work. I hear and often regurgitate things like "language prediction" but I want a more specific grasp of whats going on.

I've read great articles/posts about the environmental impact of LLMs, their dire economic situation, and their dumbing effects on people/companies/products. But the articles I've read that ask questions like "can AI think?" basically just go "well its just language and language isnt the same as thinking so no." I haven't been satisfied with this argument.

I guess I'm looking for something that dives deeper into that type of assertion that "LLMs are just language" with a critical lens. (I am not looking for a comprehensive lesson on technical side LLMs because I am not knowledgeable enough for that, some goldy locks zone would be great). If you guys have any resources you would recommend pls lmk thanks

18 评论
  • I highly recommend this:

    https://rti.github.io/gptvis

    It explains the fundamentals of a transformer network (which all current LLMs are based on) on a super tiny, down to the basics example network, allowing you to understand what is happening within the network step by step, rather than being confronted with theoretical concepts or tonnes of linear algebra.

    It's really nice and as hands on, as these things get.

  • LLMs are just language

    LLM's are not even language. They're just functions created from data and statistics. In this case the data is "writing" but that doesn't really matter since it's all stored and computed as numbers. It's a similar process for functions that generate images, etc. There's no evidence or reason why "intelligence" would manifest from completely straightforward computations. So the grift that a function is somehow "intelligent" is completely detached from humdrum computational reality.

  • There is no magic conclusion that can be made from "they are just token prediction machines". There are reasoning and search agents and other tools that can improve the final right answer token prediction. There are other ML tools and hardware innovations to make them faster, and so "think longer" before giving an answer.

    These tools are likely to keep improving their "correct answer rates", without ever achieving a "0 dumb error rate"

  • The question of “do they think” is a little complicated because I don’t think there is a clear enough definition of what counts as “thinking” to say. This discussion should be independent of the quality of LLM results.

    As for what they are actually doing:

    Imagine a mathematical function that takes in a series of numbers and spits out the next number in that series.

    A “neural network” is just a fairly general mathematical model to describe any function, and using curve fitting techniques we can approximate the previously described number pattern function.

    Now assign each letter to a number and define the pattern being a large block of text consisting of almost the entire internet.

    Now that we’ve trained our mathematical model, we can give it some text and let it complete it, and it will produce a somewhat reasonable answer.

    There are more math and computational tricks going on, and a couple more steps to get from a completion model to a conversational one, but this is the jist of how it works.

  • it's a whole branch of mathematics. looking at it from a pure language perspective isn't really useful because language models don't really think work in language. they think work in text. "llms are just language" is misleading because language implies a certain structure while language models use a completely different structure.

    i don't have any proper sources but here's a quick overview off of the top of my head:

    a large language model is a big pile of vectors (a vector here is basically a list of numbers). the "number of parameters" in a machine learning model refers to the number of dimensions of one of those vectors (or, in programming speak, the length of the list). these vectors represent coordinates on an n-dimensional "map of words". words that are related are "closer together" on this map. when you build this map, you can then use vector math to find word associations. This is important because vector math is all hardware accelerated (because of 3D graphics).

    the training process builds the map, by looking at how words and concepts appear in the input data and adjusting the numbers in the vectors until they fit. the more data, the more general the resulting map. the inference process then uses the input text as its starting point and "walks" the map.

    the emergent behaviour that some people call intelligence stems from the fact that the training process makes "novel" connections. words that are related are close together, but so are words that sound the same, for example. the more parameters a model has, the more connections it can make, and vice versa. this can lead to the "overfitting" problem, where there amount of input data is so small that the only associations are from the actual input document. using the map analogy, there may exist particular starting points where there is only one possible path. the data is not actually "in" the model, but it can be recreated exactly. the opposite can also happen, where there are so many connections for a given word that the actual topic can't be inferred from the input and the model just goes off on a tangent.

    why this is classed as intelligence i could not tell you.

    Edit: replaced some jargon that muddied the point.


    something related: you know how compressed jpegs always have visible little squares in them? jpeg compression works by making a mathematical pattern called the discrete cosine transform, slicing it into squares, and then replacing everything in the original image with references to those squares. the more you compress the more visible those squares become, because more and more parts of the image use the same square so it doesn't match as well.

    you can do this with text models as well. increasing jpeg compression is like lowering the amount of parameters. the fewer parameters the worse the model. if you compress to much, the model starts to blend concepts together or mistake words for one another.

    what the ai bros are saying now is that if you go the other way, the model may become self-aware. in my mind that's like saying that if you make a jpeg large enough, it will become real.

    • thank you for the lengthy response. I have some experience with vectors but not necessarily piles of them, though maybe I do have enough of a base to do a proper deep dive.

      I think I am grasping the idea of the n-dimensional "map" of words as you describe it, and I see how the input data can be taken as a starting point on this map. I am confused when it comes to walking the map. Is this walking, i.e. the "novel" connections of the LLMs, simply mathematics dictated by the respective values/locations of the input data? or is it more complicated? I have trouble conceptualizing how just manipulating the values of the input data could lead to conversational abilities, is it that the mapping is just absurdly complex, like the vectors just have an astronomical number of dimensions?

      • so for the case of inference, eg talking to chatgpt, the model is completely static. training can take weeks to months, so the model does not change when it's in use.

        the novel connections appear in training. it's just a matter of concepts being unexpectedly close together on the map.

        the mapping is absurdly complex. when i said n-dimensional, n is on the order of hundreds of billions of dimensions. i don't know the exact size of chatgpt's models but i know they're at least an order of magnitude or two larger than what can currently be run on consumer hardware. my computer can handle models with around 20 billion parameters before they no longer fit in RAM, and it's pretty beefy.

        as for the conversational ability, the inference step basically works like this:

        1. convert the input into a vector
        2. run the weighted algorithm on the input
        3. the top three or so most probable next words appear
        4. select one of the words semi-randomly
        5. append the word to the input
        6. goto 1.

        models are just giant piles of weights that get applied over and over to the input until it morphs into the output. we don't know exactly how the vectors correspond to the output, mostly because there's just too many parameters to analyse. but what comes out looks like intelligent conversation because that's what went in during training. the model predicts the next word, or location on the map as it were, and most text it has access to is grammatically correct and intelligent, so it's reasonable to assume that statishically speaking it will sound intelligent. assuming that it's somehow self-aware is a lot harder when you actually see it do the loop-de-loop thing of farting out a few words with varying confidence levels and then selecting one at random.

        my experience with this is more focused on images, which i think makes it easier to understand because images are more directly multidimensional than text.
        when training an image generation model, you take an input image and accompanying text description. you then basically blur the image repeatedly until it's just noise (specifically you "diffuse" it).
        at every step you record what the blur operation actually did to the image into the weights. you then apply those weights on the text description.
        the result is two of those maps: one with words, one with images, both with identical "topography".
        when you generate an image, you give some text as coordinates in the "word map", an image consisting of only noise as coordinates in the "image map", then ask the model to walk towards the word map coordinates. you then update your image to match the new coordinates and go again. basically, you're asking "in the direction of this text, what came before this image in the training data" over and over again.

    • in my mind that's like saying that if you make a jpeg large enough, it will become real.

      This is such an excellent analogy. Thank you for this!

  • Here are a few I’ve liked.

    Transformers from Scratch: https://brandonrohrer.com/transformers.html

    Goes through the mathematics behind transformer models, starting from the basics. I think it’s a good learning resource to get up to speed with the inner workings.

    Welsh Labs YouTube channel is a gold mine: https://youtube.com/@welchlabsvideo

    He’s very good with the visualization of deep learning models, and goes far and beyond other resources.

    Mapping the Mind of a Large Language Model: https://www.anthropic.com/research/mapping-mind-language-model

    I think this paper is quite interesting. It gives a clue on how ”language” is modeled in these models. What they’ve found is a ”golden gate” neuron in the Claude model. Amplifying that neuron makes it want to bring in Golden Gate Bridge into any conversation.

  • They're just statistical prediction models being leveraged for a specific purpose. Maybe start by reading up on Markov Chains. There's no awareness, actual thought or creativity involved. And that's the root of many of the issues with them. You've no doubt heard of their tendency to 'hallucinate' facts?

    There was this story a while ago - can't find it now, sorry - where somebody had asked an LLM to generate a bicycle schematic with all major parts labelled. The result wasn't merely lackluster -- it was hilariously wrong. Extra stacked handlebars, small third wheels, unconnected gears etc. The issue is that when you query a trained LLM model for an output, the only guarantee is that said output is statistically close to an amalgamation of the inputs on which it was trained, that's all. In other words, said LLM had been trained on a plethora of Internet discussions, wikis and image sets of mountainbikes, city bicyles, tandem bikes, unicycles from that one circus performer forum and on and on. All these things and more informed the model's conception of what a 'bicycle' is. But because there's no actual reasoning going on, when asked to generate a bicycle, it emitted something with the same average statistical properties as all of those things munged up together. It should not be surprising that the result didn't match what any sane human would consider a bicycle. And that's all "hallucinations" are.

    I once read a book by a lecturer from the Danish Academy of Fine Arts. In it, the author raised a number of very good points, one of which I think is pertinent here: It was pointed out, that when he asked novices to draw A) a Stilleben of a head of cauliflower and B) a portrait of a person, and then asked them to self-evaluate how well they did, all of them consistently thought they did better with the cauliflower. That, of course, wasn't the case. The truth was that they sucked at both equally, but we've evolved to be very good at recognizing faces, but not heads of cauliflower (hence the Uncanny Valley syndrome). The reason why I bring that up here, is because we're seeing something horrifyingly similar with the output of LLM's: When you're an expert in a subject matter, it becomes immediately clear that LLM's don't do better with, say, code then they do with a labelled bicycle - it's all trash. It's just that most people don't know what good code is or what it's supposed to look like. All LLM output is at best somewhat tangential to ground truth, and even then mostly by happy accident.

    It's useless. Do not trust their output; it's largely 'hallucinated' nonsense.

    • thank you for your response, the cauliflower anecdote was enlightening. your description of it being a statistical prediction model is essentially my existing conception of LLMs, but this was only really from gleaning other's conceptions online, and I've recently been concerned it was maybe an incomplete simplification of the process. I will definitely read up on markov chains to try and solidify my understanding of LLM 'prediction

      I have kind of a follow up if you have the time. I hear a lot that LLMs are "running out of data" to train on. When it comes creating a bicycle schematic, it doesn't seem like additional data would make an LLM more effective at a task like this, since its already producing a broken amalgamation. It seems like generally these shortcomings of LLMs' generalizations would not be alleviated by increased training data. So what exactly is being optimized by massive increases (at this point) in training data--or, conversely, what is threatened by a limited pot?

      I ask this because lots of people who preach that LLMs are doomed/useless seem to focus in on this idea that their training is limited. To me their generalization seems like evidence enough that we are no where near the tech-bro dreams of AGI.

      • No, you're quite correct: Additional training data might increase the potential for novel responses and thus enhance the perception of apparent creativity, but that's just another way to say "decrease correctness". To stick with the example, if you wanted to have an LLM yield a better bicycle, you should if anything be partitioning the training data and curating it. Garbage in, garbage out. Mess in, mess out.

        Put it another way: Novelty implies surprise, surprise implies randomness. Correctness implies consistently yielding the solitary correct answer. The two are inherently mutually opposed.

        If you're interested in how all this nonsense got started, I highly recommend going back and reading Weizenbaum's original 1966 paper on ELIZA. Even back then, he knew better:

        If, for example, one were to tell a psychiatrist "I went for a long boat ride" and he responded "Tell me about boats", one would not assume that he knew nothing about boats, but that he had some purpose in so directing the subsequent conversation. It is important to note that this assumption is one made by the speaker. Whether it is realistic or not is an altogether separate question. In any case, it has a crucial psychological utility in that it serves the speaker to maintain his sense of being heard and understood. The speaker further defends his impression (which even in real life may be illusory) by attributing to his conversational partner all sorts of background knowledge, insights and reasoning ability. But again, these are the speaker's contribution to the conversation.

        Weizenbaum quickly discovered the harmful effects of human interactions with these kinds of models:

        "I had not realized ... that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." (1976)

  • Look into /r/localllama.

    Download llama.cpp, mikupad, and experiment with some local models in raw formatting.

    And look at the history of language modeling, not LLMs. And stay a million miles away from tech bros like Altman.

  • Cgp grey has a video about algorithms that is very accessible, and gives a good visual display of how they work. It's kind of stuck in my brain and I think back to it every now and then. I think the same explanation will apply to LLMs.

    To sum up - they are good at making associations between data - but nobody knows which linkages are made or how/why the final results of the LLM are made. That's why you get outcomes like chatgpt making statements with racist undertones - if there is racism in the training data (e.g., racist internet comments) those associations will carry thru to the final result.

    https://youtu.be/R9OHn5ZF4Uo

  • I want to say there is a video where someone goes over making an almost gpt from scratch or something.

18 评论