Insights about LLMs
PS: highly unstructured and unedited draft. But good enough to be valuable to you.
Insights from dwarkesh patel’s podcast
With longer context, gemini was able to learn an obscure human language, completely in context.
During inference, attention operation is linear with respect to context length. The quadratic nature is a problem while training only.
Longer context yields better problem solving ability. Predicting the next token becomes easier with longer context. In that respect, it is already superhuman. Superhuman context allows it a working memory that humans don’t have. It may have interesting implications. This is unique to LLMs.
More learning is now happening in the forward pass, instead of fixed training.
GPT2 -> GPT3, the key shift was that GPT3 could do meta learning. It was completely unexpected and emergent behaviour of scale.
By having more context, the model gets more forward passes during inference, and therefore can learn more complex things.
Human brain is recurrent in nature. It can spend more compute on harder problems, and keep going deeper. But there’s always a finite number of forward passes you can do. Technically, jhuman language supports infinite recursion, but practivally, you only see 5-7 levels because of human limiatatino of working memory. So, adding more layers can get to human intelligence, to the extent it lets us close to this level of recursion.
Anthropic has a definition of what LLMs are doing. They are like a neural computer, doing read/write operations.
70% of human neurons are in cerebellum, which controls fine motor skills and attention. If you try to model via maths, the mechanism to retrieve information from a lossy source, it resembles the cerebellum architeture in variety of species, and it looks similar to the attention mechanism of neural networks we have today. So, there’s a 3 way convergence here.
Most intelligence is pattern matching and association. If you can have a large enough heirarchy of associations, you can do pattern matching effectively.
If you query your brain for someone’s face in a rainy and dark setting, it returns something, and then you update the query until you get something that matches the reality, and then that also queries something that gives you associated memories of that person. Similarly, when you query alphabet A, it gives B, and so on.
Human memory is reconstructive, and is linked with imagination. When you are recalling a memory, you are getting a dense representation of it back, and then you are reconstructing it back, so there are imaginary bits in there.
Ability of sherlock holmes to see clues, and then have a higher level abstract representation of their associations and pattern match against them.
If intelligence is just associations all the way down, then is worry about intelligence explosion justified, given that it’ll just be building higher level associations only, and be bounded by what humans can do, just faster.
Insights from Ilya about progress in LLM capabilities
Reliability, where we can fully trust the LLM to not miss critical details. Like a summary will not miss an obviously important detail.
Progress is going to astound us.
Ability to clearly follow the instructions and intent of its operator.
Multimodality. Ingest and produce all modalities. Results in a better world model.
GPT4 had fundamental architecture improvements that makes it a better next token predictor. Focus on predicting the next token will lead to universal reasoning.
Reliability jumps are to be expected
Notes from a16z podcast b/w marc and ben
build data moats and let better ai help you. a research app that has data from curated sources like a16z podcast, etc for investors and founders.
alignment makes models dumber
models will know more hallucinate less, so think of all the axes and see where growth is inevitable. context length. speed. problem solving.
tests for superhuman reasoning skills.
LLMs simulates average inteligence on its base prompt. But data from smart humans is in there. It needs better prompting to access that part in the latent space. You can unlock the latent supergenius. Maybe you can finetune on smart people data only.
Can AI invent new physics. That kind of intelligence is 1 in 3billion in humans.
I think yes. The building blocks are imagination and rationality. And you can do that with AI. Question is, how much compute you are willing to spend on that. A smarter AI will get to that in less compute budget.
Given a baseline intelligence, you can just let it spend more compute to simulate higher intelligence.
AI currently is better at validation and scoring, than generation. People miscalculate this because they anthropomorphise it, by thinking that how can something be better at evaluation and not be equally good at generation. But they are 2 separate functions inside the LLM that do those things and one can be better than other.
One alpha is to find such capability pairs, where one is superhuman in LLMs and can aid the human with its counterpart.
Future improments are going to come from
more compute, and training for longer on same data. Currently there’s a constraint on that.
More data. There’s lots of it in the world and startups are getting it.
Higher quality data. MS trained a good model on higher quality smaller dataset.
Better talent being funneled in.
GPT Wrapper Argument
Traditionally the platform layer is just a building block. The Product then uses it to solve a problem for the custoemr by their unique understanding of their pain points and workflows.
If you are building something that is analogous to this, is likely to work. Where you are using the platform to extract a functionality, that the platform is not designed to do. And getting that functionality out of the platform is really hard. That;s the way to go.
Work backward from pricing and value you can offer. e.g, debt collection using AI agents. OpenAI gpt is not going to collect debt on your behalf. You need specific integration and domain knowledge on top of base intelligence. You can use this as a test for your idea, by thinking how much can you charge. So, value capture tells you a lot about defensibility of your idea. Can you charge the value that you are creating ? Or are you just charging for the amount of work you did extra by creating a wrapper, and are hoping that your customer won’t do so.
It’s clear that model layer is going to be commoditised, and people can just plug and play different model. So, the value is going to accrue to the tools and orchestration layer.
See also Introspection for Self Alignment
Economics of Models
Better models increase the software quality and decreases the number of develoers needed to create it, so paradoxically, it can create more demand for software and increase quality expectations because there’s more competition. Like the use of CGI in hollywood made the audience expectations rise and movie making costs even more now even with efficiency improvements.
Humans have an ability to come up with new things they need. What are going to we need next?
More things up the ladder
Demand for software is perfectly elastic. As price goes down, demand goes up. As soon as constraints go down, people always find a way to automate more things.
Things that are high dimensional are a good fit for AI. Like medical diagnoses for you everyday based on biomarkers and bloodwork.
Moats
Data is moat is the new cliche. But it’s not a moat. huge amount of data on internet trumps your specific data that may be useful in corner cases. There’s no market for data. If data had value then you’d see large marketplaces for it. There’s a small marketplace for sure.
A16Z improved their investor relations product through use of their data on company performance, where their LPs can ask the question to AI about the current track record.
A very specific kind of data has value. In most cases you can just increase your own competitiveness by using it.
Nobody has data they can directly sell to others. But most have data they can feed to an intelligence and improve their business.
large companies are using personal data probably because they are not releasing any info around that.
Internet vs AI
Internet was a network and AI is more like a new kind of computer.
That decides the kind of competitive dynamics you’ll have and the opportunities.
Internet enabled applications that run on top of networks and enable network effects at scale and positive feedback loops.
There are some network effects in it, but it is more like microprocessor. Information goes in, and an output comes out.
LLMs are probabilistic computers.
There’s composability with AI.
The lessons from early computer market are more applicable.
Original computers were few and large in size, dominated by large corps.
People thought that very few people would need compute, like mega corps only.
Nowadays there’s the idea that there’s few large god models. But if we follow the computer trajectory, we now have chips everywhere in all shapes and sizes.
Modern cars have 200 computers. Everything is running on electricity, has a chip and is connected to internet.
If forms a kind of tree, with large supercomputers sitting in datacenters, and then IOT devices at the leaves, with PCs, smartphones in the middle. And the kind you use depends on what you need.
This means we’ll have different sizes of models. It’ll be an ecosystem of models.
Earlier the complexity of using computers was high.
AI is the easiest computer to use because it uses english. What is the lock in here ? Size, price, speed, choice ? Do you have free choice across these dimensions or are you locked in to the God model.
Every AI company is going to get funded and a lot of them will go bust and there’ll be an overbuild out of chips. Some chip companies will go bankrupt. Investors will lose a lot of money. We don’t need that many AI companies.
This is just the nature of all technology. Hype cycles help us build the infra.
Internet went through an open phase. Initially networks used to be proprietory and then internet showed up and everything opened up, until some big companies ended up owning the discovery part, like Google.

