AI Computing Hardware - Past, Present, and Future
en
January 29, 2025
TLDR: This podcast episode dives into the history, current state, and future of computer hardware that drives AI development. Topics covered include recent dealings of Google and Mistral, updates from ChatGPT, Synthesia, and regulatory changes impacting tech firms like Nvidia; the technical aspects discussed are historical recap of AI and hardware, rise of GPUs, scaling laws, memory and logic in AI hardware, challenges in AI hardware, and the future of AI compute.

In this detailed podcast episode, hosts Andrey Kurenkov and guest Jeremy Harris dive deep into the intricate world of AI computing hardware, exploring its evolution from the early days to its current state and future capabilities. With a strong focus on trends in data center investments and the rapid advancements in AI hardware technologies, this discussion sheds light on several pivotal topics.
Historical Context of AI Hardware
The journey of AI and hardware began much earlier than most realize. Key milestones include:
- Early Concepts: Alan Turing's theories in the 1950s laid the groundwork for AI, with early programs like Marvin Minsky's neural analog reinforcement calculator simulating learning behaviors.
- The Birth of Neural Networks: In the late 1950s, Frank Rosenblatt created the perceptron, marking the first demonstration of neural networks.
- Custom Hardware Development: The 70s and 80s saw the rise of custom hardware, specifically designed for AI applications.
- The GPU Revolution: The late 1990s introduced GPUs which revolutionized AI training by allowing parallel processing of vast amounts of data. This laid the foundation for deep learning breakthroughs in the 2010s.
Key Developments in AI Hardware
The Rise of GPUs and Deep Learning
- Parallelism in AI Models: GPUs made it feasible to train complex models, significantly speeding up computations and enhancing efficiency.
- Major Papers: The publication of AlexNet in 2012 showcased how GPUs could effectively train deep neural networks, pushing the boundaries of AI capabilities.
Scaling Laws in AI Models
- The emergence of scaling laws in AI models has led to a better understanding of how increasing model size correlates with improved performance.
- OpenAI's Innovations: OpenAI's work in large-scale models like GPT-3 illustrated that larger datasets and models yield better AI performance.
Current Trends and Future Predictions
AI Hardware Landscape
- Customized AI Chips: Many companies are now developing custom chips for specific tasks, demonstrating a shift towards tailored hardware solutions.
- Emerging Challenges: The podcast discusses the challenges posed by the "memory wall"—the disparity between memory access speed and logic processing power.
Major Technologies Impacting AI Hardware
- Moore's Law vs. Huang's Law: While Moore's Law predicts a slower increase in semiconductor efficiency, Huang's Law reflects advancements in GPU design, suggesting a continued exponential growth in GPU performance essential for AI.
Future Directions in AI Computing
- The ongoing need for larger data centers and more powerful computing systems is prompting massive investments in cutting-edge AI hardware.
- Understanding semiconductor fabrication processes is crucial, with companies like TSMC leading the charge in producing state-of-the-art chips. The complexities of creating memory components and processing units will shape the future of AI capabilities.
Conclusion: The Intersection of AI and Hardware Development
This episode compellingly argues that the future of AI is inextricably linked to advancements in computing hardware. As AI models become increasingly complex, the demand for tailored hardware solutions grows, pushing the boundaries of what is possible. The intricate relationship between AI advancements and the hardware designed to support them is critical to understanding the evolving landscape of artificial intelligence.
Key Takeaways
- Understanding History: Recognizing the historical context of AI and hardware can inform current trends and future predictions.
- Emphasis on Customization: Companies are leaning towards custom hardware to meet specific AI needs, reflecting a shift in how AI is being developed and deployed.
- Continued Investment Needed: Sustained growth in AI applications will require significant investment in data centers and advanced hardware solutions.
This podcast episode not only delves into the technicalities and nuances of AI hardware but also touches upon broader implications for the future of AI technologies, making it a must-listen for enthusiasts and industry professionals alike.
Was this summary helpful?
Bye bye.
Hello and welcome to the last week in a podcast. We can hear us chat about what's going on with AI. And not as usual, in this episode, we will not summarize or discuss some of last week's most interesting AI news. Instead, this is our long-promised episode on hardware. We'll get into a lot of detail, basically do a deep dive unrelated to any AI news, but I guess related to the general trends we've seen this past year with a lot of developments in
hardware and crazy investments in data centers. So to recap, I am one of your hosts, Andrei Kurekov. I study the AI, and I now work in a startup. Yeah, I'm Jeremy Harris. I'm the co-founder of Gladstone AI, an AI national security company. And I guess just by way of context of my end, too, on the hardware piece. So the work that we do is focused on the kind of WMD level risks that come from advanced AI, current and increasingly future systems.
So my footprint on this is I look at AI a lot through the lens of hardware because we're so focused on things like export controls. How do we prevent China, for example, from getting their hands on the stuff? What kinds of attacks? One of the things we've been looking into recently, what kinds of attacks can people execute against highly secure data centers in the West? Whether that's to exfiltrate models, whether that's to change the behavior strategically of models that are being trained, whether that's just to blow up facilities,
So a lot of our work is done these days with special forces and folks in the intelligence community as well as increasingly some data center companies to figure out how to secure these sites.
and obviously all the kind of US government work that we've been doing historically. So that's kind of my lens on it and obviously the alignment stuff and all that jazz. So I guess I know enough to be dangerous on the AI and compute side, but I'm not a PhD in AI and compute, right? My specialization is I know what I need to know for the security piece.
And so to the extent possible, we'll try to flag some resources and make people feel to check out if you're interested in doing those deeper dives on some of the other facets of this, especially compute that doesn't have to do with AI, compute that's not national security kind of related stuff. So hopefully that's useful.
Yeah, I guess we're flagging to my end. I studied software in AI, I trained algorithms, so I have relatively little understanding of how it all works. Actually, I just use GPUs and kind of broadly know what they do.
But I'll be here listening and learning from Jeremy as well, I'm sure. I'm sure it'll go both ways. I'm excited for this. Anyway, I think there's a lot of opportunity here for us to cross-pollinate. Let's just get into it. So I thought to begin, before we dive into the details of what's going on today, we could do a quick historical recap of fun details in the past of AI and hardware.
There's some interesting details there. AI and hardware go back to basically the beginning, right? Turing was a super influential person within world of computing and then Turing game, right? His invention to try and...
I guess measure when we'll get AI or AGI, as you might say, and that's still widely discussed today. So even before we had actual computers that were general purpose, people were thinking about it. By the way, that imitation game piece, in a way, it's freakish how far back it goes.
I've never read Dune, but I know there's a reference in there to the Butlerian Jihad. So Butler, back in the 1860s, I'm showing off how little I know my dates here, but he was the first to observe that you could get like, hey, these machines seem to be popping up all around us. We're industrializing. We're building these things.
What if one day we start building machines that can help us build other machines? Eventually, will they need us? It wasn't with respect to compute or anything like that, but it's an interesting thing. When you look back at how incredibly prescient some people were about this sort of thing. Anyway, sorry, I didn't mean to derail, but you're getting a great point here that it goes way, way before the days of early 2000s, people starting to worry about loss of control.
Yeah, wow. You also reminded me that it's called the imitation game. The Turing game is not a thing. There's a Turing test that people call the imitation game as it was originally published. Anyways, so yeah, it was conceptually, of course, on people's minds for a very long time, the concept of AI, robotics, et cetera. But even as we go into the 50s and get into actual computing still with vacuum tubes,
not even getting to semiconductors yet, there's the beginnings of AI as a field in that time. So one of the very early initiatives that could be considered AI was this little program that played checkers. And you can go as early as 1951 there where someone wrote a program to do it.
And then, yeah, there's a couple examples of things like that in that decade that showcased the very first AI programs. So there was a program from Marvin Minsky, actually called the Stochastic Neural Analog Reinforcement Calculator. I actually just learned about this in doing a prep for the show. I found it quite interesting. This was actually kind of a little newer than that, that Marvin Minsky built in the hardware and it
simulated rats learning in like a little maze and trying to simulate reinforcement learning as there are also theories coming out about human learning, brain learning, etc. And to give you some context, there are maybe 400 neurons. I forget a small number. Each neuron had six vacuum tubes and a motor. And the entire machine is the size of a grand piano with 300 vacuum tubes. So
they had that early example of a custom-built computer for this application. That's actually one thing, too, right? In the history of computing, everything was so custom for so long. That's something that's easy to lose sight of. The idea of building these very scalable
Modules of computing, you know, having ways to integrate all these things together. That wasn't until really Intel came into the game. That was their big thing at first, as I recall. The thing that broke Intel in was like, hey, we'll just come up with something. It's not bespoke, so it won't be as good at a specific application, but boy can it scale. All the time before that, you have all these, like you said, ridiculously bespoke kind of thing. So it's more almost physics, in a sense, than computer science, if that makes sense. Yeah.
Exactly, yeah, it was a lot of people pulling together and building little machines to demonstrate really theories about AI. There is a fun other example I found where there is the famous IBM 701 and 702. IBM was just starting to build this massive
mainframes that were kind of a main paradigm for computing for a little while, especially in business. So the IBM 7001 was the first commercial scientific computer. And there is Arthur Samuel who wrote a checkers program. And it was maybe one of the first definitely learning programs that was demonstrated. So it had very kind of primitive machine learning built into it. It had
memorization as one idea, but then also some learning from experience. And that's one of the very first demonstration of something like machine learning. Then famously, there's also the perceptron that goes to 1958, 1959. And that is sort of the first real demonstration, I would say, of the idea of neural nets, famously by Frank Rosenblatt.
Again, a custom built machine at that point that had kind of these, if you look, there's photos of it online. It looks like this crazy tangle of wires that built a tiny new net that could learn to differentiate shapes. And at the time, the Rosenblad and others were very excited about it. And then, of course, a decade later, kind of the excitement night out for a little while.
And then there's some interesting history we won't be getting into later in the 80s with custom built hardware. There were like custom hardware for expert systems that were being sold and being bought for a little while. There were this thing called Lisp machines where Lisp was a pretty major
language in AI for quite a while. It was developed kind of two right AI programs. And then there were custom machines called LIS machines that were utilized by, I guess, scientists and researchers that were doing this research going into the 70s and 80s when there was a lot of research in the realm of, I guess, logical AI and search and so on, symbolic AI.
Then, again, continue the quick recap about the history of AI and computing. We get into 80s, 90s. So the LIS machines, the expert hardware systems died out. This is where sort of, as you said, I guess this was the beginning of general purpose computing proper with Intel and Apple and all these other players.
making hardware that doesn't have to be these massive mainframes, but you could actually buy more easily and distribute more easily. And so there's kind of fewer examples of hardware details aside from what will become deep blue in the late 90s. IBM was working on this massive computer, especially for playing chess. And I think a lot of people might not know this, that deep blue
wasn't just a program. It was like a massive investment in hardware so that it could do these ridiculously long searches. It was really not a learning algorithm to my knowledge. Basically, it was doing kind of a
well-known search with some heuristics approach to chess with some hard-coded evaluation schemes. But to actually win chess, there was to build some crazy hardware specialized for playing chess. And that was how we got the demonstration without any machine learning of a sort we have today.
And let's finish off the historical recap. So of course, we had Moore's law all throughout this computing was getting more and more powerful. So we saw research into neural nets, making a comeback in the ages of 90s. But I believe at that point, people were still using CPUs and trying to train these neural nets without any sort of parallel computing as is a common paradigm today.
Parallel computing came into the picture with GPUs, graphics processing units that were needed to do 3D graphics, right? And so there was a lot of work starting in around the late 90s and I'm going into 2000s. That's how NVIDIA came to be by building these graphics processing units that were in large part for the gaming market. And then kind of throughout the 2000s,
before 2010s, a few groups were finding that you could then use these GPUs for scientific applications. You could solve, for instance, general linear algebra programs. And so this was before the idea of using it for neural nets, but it kind of bubbled up.
To the point that by 2009, there was some work by Andrew Ng, applying it. There was the rise of CUDA where you could actually program these and video GPUs for whatever application you want. And then, of course, famously in 2012, there was the AlexNet paper where
We had the AlexNet neural net, one of the first deep neural nets that was published and destroyed the other algorithms being used at the time on the image net benchmark. And to do that, one of the major actually novelties from the paper and why it succeeded was that they were among the first to use GPUs to train this big network. Probably they couldn't have otherwise.
They used two NVIDIA GPUs to do this, and they had to do a whole bunch of custom programming to even be able to do that. That was one of the major contributions of students. And that was kind of when I think
NVIDIA started to get more in the VGPU for AI direction. They were already going deeper into it. They wrote, could the Nen see you? Yeah, and they were starting to kind of specialize their hardware in various reasons. They started creating architectures that were better for AI, the Kepler architecture Pascal.
et cetera. Again, for some historical background, maybe people don't realize that way before GPT, way before Chad GPT, the demonstrations of deep learning in the early 2010s were already accelerating the trend towards investment in GPUs, towards building data centers.
Definitely by the May 2010s, it was very clear that you would need deep learning for a lot of stuff, for things like translation. Google was already making big, big, big investments in it, buying DeepMind, expanding Google Brain. Of course, investing in TPUs in the mid 2010s, they developed the first customized AI hardware to manage custom AI chip.
And so throughout the 2010s, AI was already on the rise. Everyone was already on the mindset that bigger is better, but you want a bigger neural nets, bigger data sets, all of that. But then, of course, OpenAI realized that that should be cranked up to 11. You should just have 10 million or 100 million parameter models. You got to have billion parameter models, and that was
their first, well, they had many innovations, but their breakthrough was in really embracing scaling in a way that no one has before. So I think one of the things too that's worth noting there is like this rough intuition, and you can hear people, you know, pioneers like Jeff Hinton Nenring talk about the general sense that more data is better, larger models are better, all the stuff, but what really comes with the Kaplan paper, right? That famous scaling laws for neural language models paper, the proof point that was GPD three,
And GPT-2 and fairness as well, and GPT-1. But what really comes from the GPT-3 inflection point is the actual scaling laws. For the first time, we can start to project with confidence how good a model will be. And that makes it a lot easier to spend more CapEx. Now all of a sudden,
It's a million times easier to reach out to your CTO, your CEO, and say, hey, we need $100 million to build this massive compute cluster, because look at these straight lines on these log plots. So change the economics, because it decreased the risk associated with scaling.
That's right. And I think the story of OpenAI in hindsight can almost be seen as the search for the thing that scales, right? Because for the first two years, they were focusing on reinforcement learning. Some of their major kind of PR stories, you could say, but also papers was working on reinforcement learning for Dota.
for video game, Dota, and then even at the time, they were like using a lot of compute, really spending a lot of money training programs, but in a way that didn't scale, because reinforcement learning is very hard, and you can't simulate the world very well. They also were investing in robotics a lot, and they had this whole arm, and they did a lot of robotic simulations, but again, it's hard to simulate things that wouldn't scale. Evolutionary algorithms was another thread, right?
Yeah, they did a whole bunch of things right from 2015 up through 2018. And then 2017 was the Transformers paper, of course, and then around 2018, the whole kind of idea of pre-training for natural language processing arose.
So from a very beginning, not very beginning, but pretty soon after AlexNet in around 2014, people realized that if you train a deep convolutional neural net on classification, you could then use those embeddings in a general way. So the kind of intelligence there was reusable for all sorts of vision applications.
And you can basically bootstrap training from a bunch of weights that you already trained, you don't need to start from scratch, and you don't really need as much data for your task. So it didn't happen in natural language processing until around 2017-2018. That was
When language modeling was kind of seen or found out by a few initiatives as a very promising way to pre-train weights for natural language processing, BERT is one of the famous examples from around that time. And so the first DPT was developed in that context. It was one of the first big investments in pre-training at transformer on the task of language modeling.
And then, OpenAI, I guess, we don't know the exact details, but it seems like they probably were talking internally and...
got the idea that, well, this task, you can just scrape the internet to get all the data you want. So the only question is, how big can you make the transformer? The transformer is a great architecture for scaling up, because you can parallelize in GPUs, unlike in RNNs. So that was kind of necessary in a way.
And yeah, when we got GPT 2 in 2019, that was like almost a 2 billion, like 1.7 billion parameter model by far the biggest that anyone has ever trained. And even at the time, it was interesting because you had these early demos.
like on the blog where it wrote a couple paragraphs about that unicorn island or whatever. Already at that time there was discussion of like the safety implications of GPT-2 and misinformation and so on. It's they normally by then right because they'd open source GPT well GPT GPT-1
And they had set this precedent of always open sourcing their models, hence the name actually open AI. GPT-2 was the first time they experimented with what they at the time called this staged release strategy, where they would release large, incrementally larger versions of GPT-2 over time, monitor how supposedly they were seeing them get maliciously used, which
It was always implausible to me that you'd be able to tell if it was being used maliciously on the internet when it's an open source model, but OK. And then ultimately, GBP3 was closed. So yeah, they followed as you say that kind of smooth progression. Yeah, speaking of.
that we lead up to GPT-2, also what we know now from looking at the emails in the OpenAI versus Elon Musk case. It was never the plan. Yeah, some of the details there is that
The conversations in 2018 and why they started to go for profit is that they did have the general kind of belief that hardware was crucial, that Google had all the hardware. And so Google would be the one to get to AGI. And so they needed the money to get more hardware to invest more in training. And that's what kicked off all this for profit discussions in 2018.
and led eventually to Sam Altman, somehow securing $10 billion from Microsoft. I forget when this was announced, maybe 2019? I think that there was an initial $1 billion investment that I think was 2019, and then there was maybe a 2021-ish $10 billion. Okay, yeah, that sounds right. Sounds like $1 billion is more reasonable.
So yeah, I think opening I was one of the first to really embrace the idea that you need what we now know as massive data centers and training crazily paralyzed training for crazily large neural nets. And they already were going down that route with the data agent, for instance, where they're training in very large clusters and even at that time it was very challenging.
Anyways, when we get to GPT-3, we get to 175 billion primary models, we get to scaling laws, and we get to in-context learning.
And then by that point, it had become clear that you could scale and you could get to very powerful language models. And the whole idea of in-context learning was kind of mind-blowing. Somehow, everyone was still not kind of convinced enough to invest. Like looking back, it's kind of interesting that Meta and Google and so on, weren't training massive neural nets.
language models. I think internally, Google was to some extent, but they were not trying to commercialize it. They were not trying to push forward. And then, of course, you had Chad GPT in 2022 with GPT 3.5, I think at the time.
that blew up. And now everyone cares about massive neural nets, massive language models, and everyone wants massive data centers and is fighting over velocity needed to fuel them. Elon Musk is buying a hundred thousand GPUs and hardware is like a huge, huge part of a story, clearly.
By the way, the story of hardware isn't in a sense. We are talking about the story of the physical infrastructure that very plausibly will lead to superintelligence in our lifetime. I think there almost isn't anything more important that's physical to study and understand in the world.
It's also we're lucky because it's a fascinating story. We're not just talking about egos and billionaire dollars chasing after this stuff. That's scientific level. It's fascinating at a business level. It's fascinating. Every layer of the stack is fascinating. That's one of the reasons I'm so excited about this episode, but you framed up really nicely. What is this current moment? We have the sense that scaling in the form of scaling compute, scaling data and scaling model size, but that's
relatively easier to do is king. The bitter lesson, the rich Sutton argument that came out right before scaling laws for neural language models came out in the 2019 era says basically, hey, all these fancy AI researchers running around, coming up with new fancy architectures, and thinking that's how we're going to make AGI. Unfortunately, I know you want that to be a fancy AGI.
Unfortunately, human cleverness just isn't the factor we would have hoped it was. It's so sad. It's so sad. That's why it's the bitter lesson. Instead, what you ought to do, really, this is the core of the bitter lesson, is get out of the way of your models, just let them be, let them scale. Just take a dumb-ass model and scale it with tons of compute, and you're going to get something really impressive. And he was alluding in part to the successes of
So early successive language modeling also reinforcement learning. So it wasn't clear what the architecture was that would do this very soon. It would turn out to clearly be the transformer. But you know, you can improve on that. It really models the way to think about models is or architectures is that they're just a particular kind of funnel that like
Pores compute that you pour in at the top and shapes it in the direction of intelligence. They're just your funnel. They're not the most important part of it. There are many different shapes of funnel that we'll do, many different aperture widths and all that stuff. And you know, if your funnel is kind of stupid, well,
Just wait until compute gets slashed in cost by 50% next year, or the year after, and your same stupid architecture is going to work just fine. So there's this notion that even if we are very stupid at the model architecture level, as long as we have an architecture that can take advantage of what our hardware offers, we're going to get there. That's the fundamental idea here. And what this means at a very deep level is that the future of AI
is deeply and inextricably linked to the future of compute. And the future of compute, that starts having us ask questions about Moore's law, right? Like this fundamental idea, which by the way, I mean, going historically just for a brief second here to frame this up. Now, this was back in 1975. Moore basically comes up with this observation. You know, it's not, not.
He's not saying it's a physical law, it's just an observation about how the business world functions and how, or at least the interaction between business and science, we seem to see, he says at the time, that the number of components, the number of transistors that you can put on an integrated circuit on a chip,
seems to double every year. That was his claim at the time. Now, we now know that that number actually isn't quite doubling every year. More in fact, in 1975, came back and updated his timeframes. And that's not every year. It doubles every two years. And then there was a bunch of argument back and forth about whether it should be 18 months.
The details don't really matter. The bottom line is you have this stable, reliable increase, exponential increase, right, doubling every 18 months or so in terms of the number of components, the number of computing components, transistors that you put on your chip. And that means you can get more for less, right? Your same chip can do more intelligent work.
Okay, that's basically the trend, the fundamental trend that we're gonna ride all through the years. And it's gonna take different forms and you'll hear people talk about how Moore's Law is dead and all that stuff. None of that is correct, but it's incorrect for interesting reasons. And that's gonna be part of what we'll have to talk about in this episode. And that's really the kind of landscape that we're in today. What is the jiggery poker? What are the games that we're playing today to try to keep Moore's Law going?
And how has Moore's law changed in the world where we're specifically interested in AI chips? Because now we're seeing a specific Moore's law for AI trend that's different from the historical Moore's law that we've seen for integrated circuits over the decades and decades that made Moore famous for making this prediction.
And on that point, actually, this I think is not a term that's been generally utilized, but it has been written about and video actually called it out. There's now the idea of Huang's law, where the trend in GPUs has been kind of very much in line with Moore's law even faster, where you start seeing again in the early 2010s,
the start of the idea of using it for AI and then sort of the growth of AI almost goes hand in hand with the improvements in power of GPUs. And in particular, over the last few years, you just see an explosion of the power of the cost of the size of the GPUs being developed once you get to the H100. It's
I don't like 1000 some big, big number compared to what you had a decade prior, just a decade prior, probably more than 1000. So yeah, there's kind of the idea of Huang's law, where the architecture and the, I guess, development of parallel computing in particular.
It has this exponential trend. So even if the particulars are Moore's law, which is the density you can achieve at the nano scale of semiconductors, even if that might be saturating due to inherent physics,
The architecture and the way you utilize the chips in that paralyzed computing hasn't slowed down at least so far. And that is a big part of why we are where we are. Absolutely. And in fact, that is a great segue into peeling the onion back of one more layer, right? So we have this general notion of Moore's law and now Andre is like, but there's also Huang's law. So how do you get from 2x every 18 months or so?
To all the sudden you know something closer to like forex every two months or depending on the metric you're tracking and and this is where we have to talk about what actually is a chip doing what are the core functions of a chip that really performs any kind of task. And the two core pieces that i think we're focusing on today because they're specially relevant for a i.
Number one, you have memory. You got to be able to store the data that you're working on. And then number two, you have logic. You got to have the ability to do shit to those bits and bytes that you're storing. It kind of makes sense. Put those two things together. You have a full problem solving machine. You have the ability to store information. You have the ability to do stuff to that information. Carry out mathematical operations.
memory, storage, and logic, the, yeah, the logic, the reasoning, not the reasoning, the number, the math, the number of crunching, right.
And so when we actually kind of tease these apart, it turns out, especially today, it's very, very different. It's a very, very different process, very, very different skill set that's required to make logic versus to make memory. And there are a whole bunch of reasons for that that have to do with a kind of architecture that goes into making like logic cells versus memory cells and all that stuff. But we'll get into that later if it makes sense. For now, though, I think the important thing to flag is
logic and memory are challenging to make for different reasons, and they improve at different rates. So if you look at logic improvements over the years, the ability to just pump out flops, floating point operations per second, how quickly can this chip crunch numbers, there you see very rapid improvements. And part of the reason for that, a big part of the reason, is that if you're a fab that's building a logic,
then you get to focus on basically just one top line metric that matters to you, and that's generally transistor density. In other words, how many of these compute components, how many transistors can you stuff onto a chip? That's your main metric. You care about other things like power consumption and heat dissipation, but those are pretty secondary constraints. You've got this one clean focus area.
In the meantime, if you care about memory, now you have to worry about not one key kind of KPI, you're worried about basically three main things. First off, how much can my memory hold? What is the capacity of my memory?
Second, how quickly can I pull stuff from memory, which is called latency? So basically, you can imagine, you have a bucket of memory, and you're like, I want to retrieve some bits from that memory. How long am I going to have to wait until they're available to me to do math on them? That's the latency. So we have capacity. How much can the bucket hold? Latency, how long does it take to get shit from the bucket? And then there's bandwidth. How much stuff can I pull from that memory at any one time?
And so if you're optimizing for memory, you have to optimize these three things at the same time. You're not focused exclusively on one metric and that dilutes your focus. And historically, something's got to give and that thing is usually latency. So usually when you see memory improvements, latency hasn't really gotten much better over the years.
capacity and bandwidth have. They've gotten a lot faster. So you can sort of start to imagine, depending on the problem you're trying to solve, you may want to optimize for really high capacity, really high bandwidth, really low latency, which is often more of the case in AI or some other combination of those things. So already, we've got the elements of chip design starting to form a little bit where we're thinking about, what's the balance of these things that we want to strike?
And historically, one of the challenges that's come up from this is you have, as I said, low latency. So that's the thing that's tended to be kind of crappy because people are focused when it comes to memory on capacity and bandwidth. How much can I pull it once and how big is my bucket of memory?
Because latency kind of sucks because it's been growing really slowly. One consequence is our logic has been improving really fast, right? We're able to stuff a whole bunch of transistors on a chip. What tends to happen is there's this growing disparity between your logic capability, like how fast you can number crunch on your chip and how quickly you can pull in fresh data to do new computations on.
And so you can kind of imagine the logic part of your chip, like it's just crunched all the numbers, crunched all the numbers, and then it's just sitting there, twiddling its thumbs while it waits for more memory to be fetched so it can solve the next problem. And that disparity, that gap, is basically downtime. And it's become an increasing problem because, again, transistor density logic has been improving crazy fast in AI, but latency has not, has been improving much more slowly. And so you've got this, like,
crazy high capacity to crunch numbers, but this relatively long delay between subsequent rounds of memory inputs. And this is what's known as the memory wall, or at least it's a big part of what's known as the memory wall in AI. So a big problem structurally in AI hardware is how do we overcome this?
And there are a whole bunch of techniques people work on to do this, trying to do things like, anyway, stagger your memory input so that your memory is getting fetched while your number crunching still on that previous batch of numbers so that they overlap to the maximum extent possible.
All kinds of techniques. But this is kind of the fundamental landscape is you have logic and you have memory. And logic is improving really fast. Memory is not improving quite as fast because of that dilution of focus. But both logic and memory have to come together on a high performance AI chip. And basically, the rest of the story is going to unfold with those key ingredients in mind.
So I don't know, maybe that's a good tee up for the next step here. Yeah, and I can add a little bit, I think, on that point, it's very true if you go just look at RAM capacity over years has grown very fast, but not quite as fast as Moore's Law. And one of the, I guess, fine points of memory is it's also more complex.
Well, I guess CPUs are also complex now, you parallelize, but memory is similarly complex, where for various reasons is you don't just make the memory faster, you can have smarter memory, so you introduce caching, where you know if this data is something you use a lot, so you have a faster memory, that's smaller, that you utilize and
cache important information so you can get it faster. So you have these layers of memory that have different speed, different size, right? And now you get two GPUs that need absurd amounts of memory. So on CPUs, right, we have RAM, which is random access memory, which is kind of like the fast memory that you can use. And that's usually eight gigabytes, 16 gigabytes.
A lot of your OS is in charge of getting stuff from storage from your hard drive to RAM to VAN compute on and then it gets into the cache when you do computations. Well, for neural nets, you really don't want to store anything that is not in RAM and you want as much as possible to be in cache.
So I don't know the exact details, but I do know that a lot of engineering that goes into GPUs is that those kind of caching strategies, a lot of optimizations, interest formers is about key value caching. And you have just ridiculous numbers on the RAM side of GPUs that you would never see on your CPU, your laptop, where it's usually just 8, 16, 32 gigabytes or something like that.
Yeah, absolutely. And actually, I think you introduced an element there that really helps us move into the towards the next step of the conversation, which is what happens on the floor of a data center? What does the data center floor look like? The reason is that when you think about computing, the image to have in your mind is hierarchy.
is a cascading series of increasingly complex and increasingly close to the bare silicon operation. So think about it this way, heading into a data center, right? You have just like a gigantic amount of really, really high voltage, right? And just power lines that are coming in.
Now, on the chip itself, you're dealing at roughly the electron level, you're dealing with extraordinarily tiny voltages, extraordinarily tiny currents and all that stuff, to gradually step down, to get that energy in those electrons in those photons in the middle, to do all that good work for you, you have to do a lot of gradual step downs, gradually bringing the memory, bringing the power, bringing the logic all closer and closer to
to the place where, at the atomic level almost, the actual drama can unfold that we're all after, right? The number crunching the arithmetic that actually trains the models and does inference on them. So when we think about that hierarchy, I'll identify just a couple of levels of memory for us to keep in mind.
for us to keep in red. So this just starts to kind of fold in some of these layers that we can think about as we go. But one of the higher levels of memory is sort of like flash memory, right? So this could be like your solid state drives or whatever. This is very, very slow memory.
it will continue to work even if your power goes out. So it's this persistent memory. It's slow moving, but it's the kind of thing where if you wanted to store a data set or some interesting model checkpoints that come about fairly infrequently, you might think about putting them in flash memory. This is a very slow long-term thing.
You might imagine, okay, well now I also need memory though that's going to get updated. For example, like, I don't know, every time there's like a batch of data that comes in, you know, and batches of data are coming in like constantly, constantly, constantly. So, okay, well then maybe that's your high bandwidth memory, right? So this is...
Again, closer to the chip, because we're always getting closer to the chip physically as we're getting closer to the interesting operations, the interesting math. So now you've got your HBM. Your HBM will talk about where exactly it sits, but it's really close to where the computations happen. We use a technology called DRAM, which we can talk about and actually should.
And anyway, it requires periodic refreshing to maintain data. So if you don't keep kind of updating each bit, because it stores each bit as a charge and a tiny capacitor, and because of a bunch of physical effects like leakage of current, that charge gradually drains away. So if you don't intervene, the stored data can be lost within milliseconds. So you have to keep refreshing, keep refreshing. It's much lower latency than your flash memory.
So in other words, way, way faster to pull data from it. That's critical, because again, you're pulling those batches. They're coming in pretty hot, right? And so usually that's on the order of tens of nanoseconds.
And so every kind of tens of nanoseconds, you pull some data off the HBM. Now, even closer to where the computations happen, you're going to have SRAM. So SRAM is your fastest, your ridiculous, sub-nanosecond access time, very, very expensive as well. So you can think of this as well as an expense hierarchy. As we get closer to where those computations happen, oh, we've got to get really, really small components, very, very custom designed or very purpose built.
and very expensive, right? So there's this kind of consistent hierarchy typically of size, of expense, of latency, all these things as we get closer and closer to the kind of leaves on our tree, to those kind of end nodes where we're going to do the interesting operations and data centers and chips.
These are all fractal structures in that sense. Really think about computing, you've got to think about fractals. It's fractals all the way down. You go from one trunk to branches to smaller branches, smaller branches, just like our circulatory system, just like basically all complex structures. And that's one thing, if you play fact for you.
You'll be nodding along, right? This is what it is about. The world works in fractals in this way, higher and higher resolution at the nodes, but you do want to benefit from big tree trunks, big arteries that can just have high capacity in your system.
A little fun fact, I know probably a lot of people still, and certainly as a grad student in the late 2010s, a big part of what you were doing is literally just fitting a neural net in the GPU. You're like, oh, I have this GPU with eight gigabytes of memory or 16 gigabytes. And so I'm going to do NVIDIA SMI and figure out how much
memory there is available on it and I'm gonna run my code and it's gonna load up the model into the GPU and that's how I'm gonna do my training and so for a long while that was kind of paradigm is you had one GPU one model you try to fit the model into the GPU memory that was it of course now
That doesn't work. The models are far too big for a single GPU, especially during training when you have to do backprop prep propagation, deal with gradients, et cetera. During inference, people do try to scale them down to quantization, fit them often in a single GPU. But why do you need these massive data centers? Because you want to pack a whole bunch of GPUs or TPUs all together.
We have TPU pods from Google going back quite a while to 2018, I think, when we had 256 TPUs. And so you can now distribute your neural net across a lot of chips. And now it gets even crazier because the memory isn't just about loading in the weights of
the model into a single GPU. You need to like transfer information about the gradients on some weights and do some crazy complicated orchestration just to update your weights throughout the neural neck. And I really have no idea how that works. Well, and part of that we can get into for sure. I think to touch on that, and just to make this connect by the way to some of the stuff we've been happening, we've been seeing happen recently with kind of reasoning models and the implications for
for the design of data centers and compute. This stuff really does tie in too, right? So I'll circle back to this observation, right? That memory, like HBM, high bandwidth memory in particular, has been improving more slowly than logic, right? Than the ability to just number a crunch, right? So our ability to fetch memory and, you know, fetch data from memory and the bandwidth and all that has been improving more slowly than our ability to crunch the numbers.
One interesting consequence of this is that you might expect these reasoning models that make use of more inference time compute to actually end up disproportionately running better on older chips. And so I just want to explain and unpack that a little bit. So if you have just during inference, you have to load a language model into active memory from HBM,
and your batch sizes, your data that you're feeding in, those batch sizes will tend to be pretty small. And the reason they tend to be pretty small at inference time is that you can imagine like you're getting these bursts of user data that are unpredictable. And all you know is you better send a response really quickly or it'll start to affect the user experience. So you can't afford to sit there and wait for a whole bunch of user queries to come in and then batch them, which is what's typically done, right? The idea with high bandwidth memory
is you want to be able to batch a whole bunch of data together and amortize the delay, the latency that comes from loading that memory from the high bandwidth memory, amortize it across a whole bunch of batches. So sure logic is sitting there waiting for the data to come in for a little while. But when it comes in, it's this huge batch of data. So it's like, OK, that was worth the wait. The problem is that when you have inference happening,
You can't, again, you gotta send responses quickly. So you can't wait too long to create really big batches. You've gotta kind of, well, get away with smaller batches. And as a result, your memory bandwidth isn't going to be consumed by
by the kind of user data induced data, right? Like you're getting relatively small amounts of your user data in. Your memory bandwidth is disproportionately consumed by just like the model itself. And so you have this high base cost associated with loading your model in.
And because the batch size is smaller, you don't need as much logic to run all those computations. You have maybe eight user queries instead of 64. So that's relatively easy on the flops. So you don't need as much hard compute. You don't need as much logic. What you really need, though, is that baseline high memory requirement, because your model's so big anyway. So even though your user queries are not very numerous, your model's big. So you have a high baseline need for HBM, but a relatively low need for flops.
Because flops improve more slowly, this means you can step back a generation of compute, and you're going to lose a lot of flops. But your memory is going to be about the same. And since this is more memory intensive, disproportionately, than compute intensive, it tends to favor inference sense to favor older machines.
bit of a kind of layered thing and it's okay if you follow that whole thing. But if you're interested in this, you want to listen back on that. We're asking us questions about it. I think this is actually one of the really important trends that we're going to start to see is like older hardware be useful for inference time compute, big, big advantage to China, by the way, because they only have older hardware. So this whole pivot to like reasoning and inference time compute is actually a really interesting advantage for the Chinese ecosystem.
And yeah, just to get a little, I think that brings up another interesting tangent, pretty quick tangent, we'll try to get into it. So you brought up batches of data and that's another relevant detail is you're not just loading in GPUs, you are loading in
what you consider our batches of data and that what it means is, right, you have data sets, data sets are pairs of input and output. And when you train a neural net and when you do inference on it as well, instead of just doing one input, one output, you do a whole bunch together. So you have n inputs and outputs.
And that is essential because when training a neural net, you could try to do just one example at a time, but an individual example isn't very useful, right? Because if you can update your weights for it, but then the very next example might be the opposite class. So you would be just not finding the right path. And then it's also not very feasible.
to train on the entire data set, right? You can't feed the entire data set and compute the average across all the inputs and outputs because that's going to be a probably not possible, b probably not kind of very good for learning. So one of the sort of key miracles, almost mathematical kind of surprising things is that stochastic gradient descent, where you take batches of data, you take, you know, 25
50, 256, whatever. Inputs and outputs turns out to just work really well. And theoretically, you should be taking the entire data set, right? That's what gradient descent should be doing. Circastic gradient descent, where you take batches, turns out to be probably a good regularizer that actually improves generalization instead of overfitting.
But anyway, one of the other things with OpenAI that was a little bit novel is massive batch size. So as you increase the batch, that increases the amount of memory you need on your GPU.
So the batch sizes were relatively small, typically during training, like 128, 256. Now for a bigger batch, the faster you could train and the better the performance could be. But yeah, typically you just couldn't get away with very big batches. And OpenAI still remember this was one of the early organizations getting into like 2000
example batches or something like that. And then I think one of the realizations that happened with very large models is that, especially during training, massive batches are very helpful. And so that was another reason that memory is important.
crazy advantages that OpenAI enjoys and anyone with really good distribution in this space enjoys a distribution of their products. I mean, like if you've got a whole kind of users, you've got all these queries coming in at very, very high rates, which then allow you to do bigger batches at inference time.
right? Because you may tell yourself, well, look, I've got to send a response to my users within, I don't know, like 500 milliseconds or something like that, right? And so basically what that says is, okay, you have 500 milliseconds that you can wait to collect inputs, to collect prompts from your users, and then you've got to process them all at once. Well, the
the number of users that you have at any given time is going to allow you to fill up those batches really nicely if that number is large. And that allows you to amortize the cost. You're getting more use out of your GPUs by doing that. This is one of the reasons why some of the smaller companies serving these models are realistic advantage. They're often serving them, by the way, at a loss because they just can't hit the large batch sizes that they need to amortize the cost of their hardware and energy
to be able to turn a profit. And so a lot of the VC dollars you're seeing burned right now in the space are being burned specifically because of this low batch size phenomenon, at least at inference time. At that point, in case it's not clear or maybe some people don't know, a batch, the way it works,
is yes, you're doing end to end, but then you're doing all these in parallel, right? You're giving all the inputs to altogether and you're getting all the outputs altogether. So that's why it's kind of filling up your GPU. And that is one of the essential metrics is GPU utilization rate. If you do one example at a time, that takes up less memory, but then you need, you're wasting time, right? Because you need to do one at a time versus if you get as many examples as your GPU can handle,
when you get those outputs altogether and you're utilizing your GPU 100% and I'm getting the most use out of it. Yeah, and this ties into this dance between model architecture and hardware architecture, right? Like CPUs, CPUs tend to have a handful of cores, right? The cores are the things that actually do the computations. They're super, super fast cores and they're super flexible.
but they're not very numerous. Whereas GPUs can have thousands of cores, but each individual core is very slow. And so what that sets up is a situation where if you have a very parallelizable task where you can split it up into a thousand or 4,000 or 16,000 little tasks that each core can handle in parallel,
It's fine if each core is relatively slow compared to CPU. If they're all chugging away at those numbers at once, then they can pump out thousands and thousands of these operations in the time that a CPU core might do 20 or whatever. It is slower on a per-core basis, but you have so many cores, you can amortize that and just go way, way faster.
and that is at the core of what makes AI today work. It's the fact that it's so crazy parallelizable. You can take a neural network and you can chunk it up in any number of ways. Like you could, for example, feed it a whole bunch of prompts at the same time. That's called data parallelism. Actually, that's more like you send some chunks of data over to one,
one set of GPUs and another chunk to another set. So essentially you're parallelizing the processing of that data.
You can also take your neural networks, you can slice them up layer-wise. So you can say, layers 0 to 4, they're going to sit on these GPUs, layers 5 to 8 will sit on these GPUs, and so on. That's called pipeline parallelism. So each stage of your model pipeline, you're kind of imagining chopping your model up length-wise and farming out the different chunks of your model to different GPUs.
And then there's even tensor parallelism. And this is within a particular layer. You imagine chopping that layer in half and having a GPU chew on or process data that only going through just that part of the model. And so these three kinds of parallelism, data parallelism, pipeline parallelism, and tensor parallelism, are all used together in overlapping ways.
in modern high performance AI data centers in these big training runs. And they play out at the hardware level. So you can actually see, like, you'll have, you know,
Data centers with chunks of GPUs that are all seeing one chunk of the data set. And then within those GPUs, one subset of them will be specialized in a couple of layers of the model through pipeline parallelism. And then a specific GPU within that set of GPUs will be doing a specific part.
of a layer or a couple of layers through tensor parallelism. And that's how you really split this model up across as many different machines as you can to benefit from the massive parallelism that comes from the stuff. Right. And by the way, I guess just another fun detail. Why did graphics processing units turn out to be really good for AI? Well, it all was down to matrix multiplications. It's all just a bunch of numbers. You have one
vector, one set of numbers, you need to multiply it by another vector and get the output. That's your typical layer, right? You have n connections and inputs. You have one activation unit. So you wind up having two layers and you do a vector and so on. So anyway, it turns out that to do 3D computations, that's also a bunch of math, also a bunch of matrices that you multiply.
to be able to get your rendering to happen. And so it turns out that you can do matrix multiplications very well by paralyzing over a thousand cores versus if you have some kind of long equation where you need to do every step one at a time, that's going to be on a CPU. So yeah, basically 3D rendering is a bunch of linear algebra.
neural nets are a bunch of linear algebra. So it turns out that you can then do the linear algebra from the graphics also for neural nets. And that's why that turned out to be such a good fit. And now with tensor processing units, tensor is like a matrix, but with more dimensions, right? So you do even more linear algebra. That's what it all boils down to.
Excellent summary. This is a good time. Now we've got some of the basics in place. To look at the data center floor and some current and emerging AI hardware systems that are going to be used for the next beat of scale. I'm thinking here in particular of the GB 200, semi-analysis has a great breakdown of how the GB 200 is set up. I'm pulling heavily from that in this section here with just some added
stuff thrown in just for context and depth but i do recommend semi analysis by the way, say analysis great one of the challenges with it is highly technical. So i found i've recommended it to a lot of people sometimes they'll read it and they'll be like, like i can tell this is what i need to know but,
It's really hard to get below and understand deeply what they're getting at. Hopefully, this episode will be helpful in doing that. Certainly, whenever we cover stories that send the analysis is covered, I try to do a lot of translation at least when we're at the sharing stage there. But just be warned, I guess it's a pretty expensive newsletter and it does go into technical depth. They got some free stuff as well that you should definitely check out if you're interested in that sort of thing.
I got this premonition in case anyone wants to correct me and say it's not just linear algebra because you have non-linear activations famously and that's required. Yeah, that's also in there and that's not exactly your algebra. You have functions that aren't just matrix multiplications of other values and modern activations. You kind of try to get away from that as much as possible. There's always some really new push bag.
I don't want to be factually incorrect, so just FYI, that's not what I mean. Well, and actually, mathematically, the fun fact there is that if you didn't have that nonlinearity, right, then multiplying just a bunch of matrices together would be equivalent from the linear algebra standpoint to having just one matrix, so you could replace it with anything.
OK, so let's step onto the data center for it. Let's talk about the GB 200. Why the GB 200? Well, number one, the H100 has been around for a while. We will talk about it a little bit later. But the GB 200 is the next beat and more and more kind of the future is oriented in that direction. So I think it is really worth looking at. And this is announced and not yet out from NVIDIA. Is that right? Or is it already being sold? I believe it's already being sold, but it's only just started. So this is, yeah.
It's the latest greatest in GPU technology, basically. That's it. It's got that new GPU smell. So, first thing we have to clarify, right? You'll see a lot of articles that'll say something about the B200, and then you'll see other articles that say stuff about the GB200 and the DGX.
The B200 DGX, all these things, what the fuck are these things? The first thing I want to call out is there is a thing called a B200 GPU. That is a GPU. The GPU is a very specific piece of hardware.
that is like the, let's say, a component that is going to do the interesting computations that we care about fundamentally at the silicon level. But a GPU on its own is, oh man, what's a good analogy? I mean, it's like a
It's like a really dumb jacked guy, but you can probably lift anything you want him to lift, but you have to tell him what to lift because he's a dumb guy. He's just jacked. So the B200 on its own needs something to tell it what to do. It needs a conductor, right? It needs a CPU. At least that's usually how things work here.
And so there's the B200 GPU, yes, wonderful. But if you're actually going to put it in a server rack, in a data center, you best hope that you have it paired to a CPU that can help tell it what to work on and orchestrate its activity.
even better if you can have two GPUs next to each other and a CPU between the two of them helping them to coordinate a little bit, right? Helping them do a little dance. That's good. Now your CPU, by the way, is also going to need its own memory. And so you have to imagine there's memory for that. All that good stuff. But fundamentally, we have a CPU and two GPUs.
on this little kind of motherboard, right? Yeah, that's like you have two jacked guys and you're moving it apart, and then you have a supervisor. You know what? We're getting there. We're getting there, right? Increasingly, we're going to start to replicate just like what the Roman army looked like. You have some like Colonel, and then you've got the strong soldiers or whatever, and the Colonel's telling them, I don't know. And then there's somebody telling the Colonel, I don't know.
Yeah, you got a CPU on this motherboard and you got these two B200 GPUs. In the case of the... So, okay, these are the kind of atomic ingredients for now that we'll talk about. Now, that is sitting on a motherboard, all right? A motherboard, you can imagine it as like one big rectangle. And we're going to put two rectangles together, two motherboards together. Each of them has one CPU and two B200 GPUs.
Together, that's four GPUs, that's two CPUs. Together, that's called a GB200 tray. Each one of those things is called a Bianca board. So Bianca board is one CPU, two GPUs. You put two Bianca boards together. You get a tray that's going to slot into one slot in a rack, in a server, in a data center. So that's basically what it looks like.
out the front, you can see a bunch of special connectors for each GPU that will actually allow those GPUs to connect to other GPUs in that same server rack, let's say, or very locally in their media environment through these things called NVLink cables. Basically, these are special NVIDIA copper cables. They're alternatives too, but this is kind of like an industry standard one. And so this together is, you can think of it as like one really tightly interconnected
set of GPUs, right? So why copper? The copper interconnect, and this also goes through a special switch called an NV switch that helps to mediate the connections between these GPUs. But the bottom line is you just have these GPUs really tightly connected to each other through copper interconnects. And the reason you want copper interconnects is that they're crazy efficient at getting data around those GPUs. Very expensive, by the way.
but very efficient too and so this kind of bundle of compute is going to do basically like your typically like your your highest bandwidth requirement
like tensor parallelism, like this is basically the thing that requires the most frequent communication between GPUs. So you're going to do it over your most expensive interconnector and view it. And so the more expensive the interconnect, roughly speaking, the more tightly bound these GPUs are together in a little local pod.
the more you want to use them for applications that will require frequent communication. So tensor parallelism is that because you're basically taking like a layer, a couple layers of your neural network, you're chopping them up. But in order to get a coherent output, you need to kind of recombine that data because one chunk of one layer doesn't do much for you. So they need to constantly be talking to each other really, really fast because otherwise,
It would just be a bunch of garbage. They need to be very coherent. At higher levels of abstraction, for pipeline parallelism, where you're talking about whole layers of your neural network, and one pod might be working on one set of layers, and another pod might be working on another set of layers.
For pipeline parallelism, you're going to need to communicate, but a little bit less slowly, right? Because you're not talking about chunks of a layer that just need to constantly be, to even be remotely coherent, those chunks have to come together to form one layer. At least with pipeline parallelism, you're talking about coherent whole layers. So this can happen a little bit slower. You can use interconnects like PCIe as one possibility.
or even between different nodes over network fabric and go over InfiniBand, which is another slower form of networking. The pod, though, is the basic unit of pipeline parallelism that's often used here. This is called the backend network.
So tensor parallelism, this idea, again, of we're going to slice up just parts of a layer and have like, you know, one server rack, for example, it's all connected through NVLink connector super, super efficient. That's usually called like accelerator interconnect, right? So the very local interconnect through NVLink pipeline parallelism.
It was this slightly slower, different layers communicating with each other. That is usually called the back end network in the data center. So you've got Accelerator Interconnect for the really, really fast stuff. You've got the back end network for the somewhat slower stuff. And then typically at the level of the whole data center, when you're doing data parallelism, you're sending a whole chunk of your data over to this bit, a whole chunk over that bit,
You're going to send your user queries in and they're going to get divided that way. That's the front end network. So you've got your front end for your slower lowest kind of
Let's say, typically, actually less expensive hardware too, because you're not going as fast. You've got your back end, which is faster. It's in FiniBand. And now you're moving things typically between layers. And this can vary, but I'm trying to be concrete here. And then you've got your faster thing, which is Accelerator Interconnect, even faster than the back end network, the activity that's buzzing around there.
That's one way to set up a data center. You're always going to find some kind of hierarchy like this, and the particular instantiation can vary a lot, but this is often how it's done. You're in the business. If you're designing hardware, designing models, you're in the business of saying, okay, how can I architect my model such that I can chop it up?
to have a little bit of my model on one GPU here on this GPU there, such that I can chop up my layers in this way that makes maximal use of my hardware. There is this kind of dance where you're doing very hardware-aware algorithm architecturing, especially these days, because the main rate limiting thing for you is, how do I get more out of my compute?
Right. And I think that's one of the another big aspect of TPUs and Google, right? Google was the thing that OpenAI worried about partially because of DPUs, but also a big part because they had expertise in data centers. That was part of the reason why Google went out.
They were really good at data center creation, and they were early to the game. So they not only made TPUs 10 surpassing units, they pretty quickly afterwards also worked on TPU pods, where you combine 256, 2000 TPUs together, presumably with that sort of memory optimization you're talking about to have much larger neural nets, much faster processing, et cetera.
Actually, that's a great point. There's this interesting notion of what counts as a coherent blob of compute. The real way to think about this is in terms of the latency or the timeline on which activities are unfolding at the level of that blob. What is a coherent blob of compute for tensor parallelism? Well, it's got to be really, really fast, because these computations are really quick, really efficient, but then you've got to move on really quick.
And so one of the things that Google has done really well is that these pods can actually coherently link together very large numbers of chips. And you're talking, in some cases, about like hundreds of these. I think 256 is for TPUV4, like one of the standard configurations. But one of the key things to highlight here, by the way, is there is now a difference between
the GPU, which is the B200, and the system, the GB200, the system in which it's embedded. So the GB200, by definition, is this thing that has a CPU and two GPUs on a tray, along with a bunch of other ancillary stuff.
And that's your Bianca board. And there's another Bianca board right next to it. And together, that's one GB 200 tray. So we are talking about GPUs. The basic idea behind the GB 200 is to make those GPUs do useful work, but that requires a whole bunch of ancillary infrastructure that isn't just that B 200 GPU.
And so the packaging together of those components of the B200 GPU and the CPU and all those ancillary things, that's done by companies, for example, like Foxconn that put together the servers. Once NVIDIA finishes shipping out the GPUs, somebody's got to assemble these and NVIDIA can do some of this themselves. But companies like Foxconn can step in and we covered a story, I think, with Foxconn looking at a factory in Mexico to do this sort of thing.
So they're actually building the supercomputer in a sense, like putting all these things together into servers and farming them out. There are different layers of that stack that are done by Foxconn and different by NVIDIA. But fundamentally, I just want to kind of differentiate between the GB200 system and the B200 GPU. The GB200 system also can exist in different configurations. So you can imagine a setup where you have one rack and it's got, say, 32 B200 GPUs.
and they're all tightly connected, or you could have a version where you got 72 of them, all depending on the often what I'll determine that is how much power density you actually can supply to your server racks. And if you just don't have the power infrastructure or the cooling infrastructure to keep those racks humming, then you're forced to take a hit and literally put less compute capacity in a given rack. That's one of the classic trade-offs that you face when you're designing a data center.
Yeah, and I think, and have a shout out, in case people don't have a background, another major aspect of data center design and construction is the cooling, because when you have a billion chips, whatever computing, the way semiconductors work is that you're doing some electricity and you're using some energy, which produces heat.
And when you're doing a ton of computation, like with GPUs, you get a lot of heat. You can actually warm up a bit if you really use your GPU well. So when you get to these racks, where you really try to concentrate a ton of compute altogether, you get into advanced cooling, like liquid cooling, and that's why data centers consume water, for instance, if you look at the climate impacts of
AI often they do site kind of water usage as one of the metrics. That's why you care about where you put your data center in terms of climate. And presumably, that's a big part of engineering as well of these cores of these systems. Absolutely. And in fact, that's, that's what the H 100 series of chips is. Well, one of the things that's somewhat famous for is being the first chip that has a liquid cooled configuration of the black wells, all need liquid cooling, right? So this next generation of infrastructure for the B 200 and so on.
You're going to have to have liquid cooling integrated into your data center. It's just a fact of life now because these things put off so much heat because they consume so much power. There's sort of an irreducible relationship between computation and power dissipation. Absolutely. So these two things are profoundly linked.
I think now it might make sense to double click on the B200, just the GPU. So we're not talking about the gray CPU that sits on the Bianca motherboard and helps orchestrate things, all that jazz. Specifically the B200 GPU or just let's say the GPU in general.
I think it's worth double clicking on that. What are the components of it? That'll start moving us into the fab, the packaging story. Where does TSMC come in and introducing some of the main players? Does that make sense? Yeah, I think so. We're looking at the GPU.
And right off the bat, two components that are gonna matter. This is gonna come up again, right? So we have our logic and we have our memory. The two basic things that you need to do useful shit in AI, right? So, okay, what is, let's start with the memory, right? Because we've already talked about memory, right? You care about what is the latency? What is the capacity? What is the bandwidth of this memory? Well, we're going to use this thing called high bandwidth memory, right?
And that's going to sit on our GPU. We're going to have stacks of high bandwidth memory, stacks of HBM. And you can think of these as basically like, roughly speaking, one layer of the stack is like a grid that contains a whole bunch of capacitors, a whole bunch of that each store some information. And you want to be able to pull numbers off that grid really efficiently.
Now, historically, those layers, by the way, are DRAM. DRAM was a form of memory that goes way, way back, but the innovation with HBM is stacking those layers of DRAM together and then connecting them, putting all the way through those stacks, these things called through silicon vias or TSVs.
And TSVs are important because they basically allow you to just like simultaneously pull data from all these layers at the same time, hence the massive bandwidth. You can get a lot of throughput of data through your system because you're basically drawing down from all of those kind of layers in your stack at once. So many layers of Dremen. And you'll see, you know, eight layer versions, 12 layer versions. The latest versions have have like 12 layers. The companies, by the way, that manufacture HBM.
are different from the companies that manufacture the logic that sits on the chip. So the memory companies, the HBM companies, you're thinking here basically the only two that matter are SK Hynix in South Korea and Samsung also in South Korea. There is Micron but they're in the US and they kind of suck. They have like none of the market right now.
But yeah, so fundamentally, when you're looking at like, you know, Nvidia GPUs, you're going to have, you know, like HBM stacks from say, SK Hynix. And they're just really good at pulling out massive amounts of data. The latency is, you know, not great, but, but you'll pull down massive amounts of data at the same time and feed them into your logic die.
Right? Your main GPU die, or your compute die, people use all these terms kind of interchangeably, but that refers to the logic part of your GPU that's actually going to do the computation. Now, this is for the H100, sometimes known as the GH100, but this is the fundamentally the place where the magic happens. So you're pulling into the logic die, this data from the HBM, in massive quantities all at once.
One thing to recognize about the difference between HBM and the kind of main GPU die, the process to fabricate these things is very different. So you need a very different set of expertise to make HBM high bandwidth memory versus to make a really good logic die.
And this means that the fabs, the manufacturing facilities that actually build these things are different. So SK Hynix might do your HBM, but TSMC is almost certainly going to do your logic die. And the process reasons, part of it is also the effective resolution.
So logic dies are these very irregular structures, right? We talked about how high bandwidth memory is this, you know, these like stacked grids basically. They're very regular and as a result,
a couple things like you don't need as high resolution in your fabrication process. So you'll typically see people use like 10 to 14 nanometer processes to do like HBM3, for example. But if you're looking at logic, for the logic die, you're building transistors that are kind of like these weird or regular structures that are extremely bespoke and all that. And as a result, you need a much, much higher grade process, typically four to five nanometer processes.
That doesn't mean that TSMC could just turn around. So TSMC is usually the ones who they do all the kind of truly leading edge processes. They can't really turn around and just make HBM very easily. Again, different set of core competencies. And so what has to happen is you're going to source your HBM from one company. You're going to source your logic from another. And now you need to make them dance together. Somehow you need to include both the logic and the memory on the same chip. And for that nowadays, the solution people have turned to is to use an interposer.
So an interposer is a structure that the logic and the memory and a couple other components too are going to sit on. And the interposer essentially allows you to connect, like say from the bottom of the HBM to the bottom of the logic to create these kind of like chip level connections that link your different, well, your different chips, sorry, not chips, but your different components together.
And this is called packaging, this process of doing this packaging. Now, TSMC famously has this coass packaging process. There are two kinds of coass. There's coass S and coass L, the details we don't have time to get into, but they are kind of fascinating. The bottom line is that this is just a way of, number one, linking together your memory die and your main GPU die, your logic die.
But also, an interesting thing that happens is as you move down the package, the resolution of the interconnects gets lower and lower. Things get coarser and coarser, bigger and bigger. And what you're trying to do is, at the chick level, you've got crazy high resolution kind of...
Connections happening like your your pitch sizes sometimes called that sort of resolution of the structure. Is is really really fine is really really small you want to actually deliberately decrease that as quickly as you can. Because it allows you to have thicker wires which are you know better for more efficient for from a power delivery standpoint.
make it possible for you to use kind of like more antiquated fabrication processes and all that stuff. As quickly as possible, you want to get away from things that require you to use really, really advanced processes and things like that. So this is basically the landscape. You've got a bunch of stacked dram. In other words, high bandwidth memory.
Those stacks of memory sitting next to a GPU die, a logic die, that's actually going to do the computations. And those are all sitting on top of an interposer, which links them together and has a bunch of, anyway, really nice thermal and other properties.
And then at that point, you know, we mentioned TSMC and fabs and their part in the story, which I think deserves a little bit more background, right? So fab means fabrication. That's where you take the basic building block, like the raw material and convert it into computing. So let's dive in a little bit what it involves for any less technical people. First, what is a semiconductor? It's literally a semiconductor. It's a material that
due to magic of quantum mechanics and other stuff, you can use it to let current through or not. Fundamentally, that's the smallest building block of computing. And so what is a fab? It's something that takes raw material and creates like nanometer scale sculptures or structures of material that you can then give power to. You can kind of power it on or off.
and that you can combine various patterns to do computations. So why is fabrication so complicated? Why is TSMC the one player that really matters? It sounds like there are a couple of organizations that can do fabrication, but TSMC is by far the best. Because
It's like a figure as we mentioned before, like the most advanced technology that humanity has ever made. You're trying to take the strong material and literally make these nanometer sized patterns in it.
for semiconductors, right? You need to do a little sculpture of raw material in a certain way and do that a billion times in a way that allows for very few imperfections. And as you might imagine, when you're dealing with non-meter sized patterns, it's pretty easy to mess up. Like you let one little particle of dust into it, and that's bigger than, I don't know how many transistors, but it's pretty big.
And there's like a million things that could go wrong that could mess up the chip. And so it's a super, like the most delicate, intricate thing you can attempt to do. And the technologies that enable this to actually do the fabrication at nanometer scale levels. And then now we are getting to that sort of place where the quantum effects are crazy and so on. But anyway,
The technology there is incredibly incredibly complicated incredibly advanced and incredibly delicate so as we've kind of previewed.
You're now seeing TSFT trying to get to the US, and it's taking them. It's going to take from years to set up a fab. And that's because you have a lot of advanced equipment that you need to set up in a very, very delicate way. And you're literally kind of taking large blocks of raw material, literally these slabs of
silicon, I believe, and you're cutting it into little circles. You need to transfer that all around to various machines that do various operations. And somehow you need to end up with something that has the right set of patterns. So it's, it's fascinating how all this works. And the advanced aspects of it, I don't really know, it's, it's insane. And it costs hundreds of million dollars as we've covered to get the most advanced technology.
you have like one corporation that can do technology required to make these patterns at like two nanometer, whatever resolution we have nowadays. And so that's why fabrication is such a big part of the story. That's why NVIDIA farms out fabrication to a TSMC. They have just perfected the art of it.
and they have the expertise and the capability to do this thing that very, very few organizations are capable of even trying. And that, by the way, is also why China can't just easily catch up and do these most advanced chips. It's just incredibly advanced technology.
Yeah, absolutely. And I think, so as we discussed this, by the way, we're going to talk about things called process nodes, or processes, or nodes. So these are fabrication processes that fabs like TSMC use. TSMC likes to identify their processes with a number in nanometers, historically, at least up until now. So they talk about, for example, the seven nanometer process node,
or the five nanometer process node. And famously, people refer to this as, well, there are three layers of understanding when it comes to that terminology. The first layer is to say something like, when we say seven nanometer process node, we mean that they're fabricating
their semiconductor is down to seven nanometer resolution, right? Which sounds really impressive. Then people point out at the next layer, oh, that's actually a lie. They'll sometimes call it marketing terminology, which I don't think is accurate. That speaks to the third layer. The phrase seven nanometers is sometimes referred to as a piece of marketing terminology because it's true. There's no actual component in there that is like seven nanometer resolution. Like it's not like there's any piece of that that is truly physically down to seven nanometers.
But what the seven nanometer thing really refers to is it's the performance you would get if historical trends in Moore's Law continued, you know, there was a time back when we're talking about the, you know, the two micron resolution that it actually did specify that. And if you kept that trend going, the transistor density you would end up with would be that associated with hitting the seven nanometer threshold. We're just doing it in different ways.
So, my kind of lukewarm take on this is, I don't know that it's actually marketing terminology so much as it is the outcome-based terminology that you actually care about as a buyer, right? You care about, will this perform as if you were fabbing down to seven nanometers?
or will it perform as if you're fabbing down to three? And that's the way that you're able to get to numbers of nanometers that are like, you know, we're getting to the point where it's like, you know, a couple of angstroms, right? Like a 10 hydrogen atoms strung together. Obviously, we're not able to actually fab down to that level. And if we could, there'd be all kinds of quantum tunneling effects that would make it impossible. So that's the basic idea here. Today's leading
Leading node is switching over to the two nanometer node right now. What you'll tend to see is the leading node is subsidized, basically entirely, by Apple. So phone companies, they want it small, they want it fast, Apple is willing to spend, and so they will work with TSMC to develop the leading node each year, each cycle, right?
And that's a massive partnership boost for TSMC. Other companies, former competitors of TSMC like global foundries, suffer a lot because they need a partner to help them subsidize that next node development. So this is a big, big strategic kind of moat for TSMC. They have a partner like Apple that's willing to do that.
This means Apple monopolizes the most kind of advanced node for their phones every year, then that leaves the next node up free for AI applications. The interesting thing, by the way, is that might change. You could see that start to change as AI becomes more and more in demand, as NVIDIA is able to kind of compete with Apple potentially down the line for the very same deal with TSMC, right? If AI is just, if
fueling way more revenue than iPhone sales or whatever else. Well, now all of a sudden Nvidia might be able to muscle in and you might see a change in that dynamic. But at least for right now, that's how it's playing out. And so Nvidia now gets to work with the 5nm process for the H100. That's the process they used for it. They actually started to use the 4nm process, which really is a variant of the 5nm, but the details don't super matter there. Fundamentally, the story then is about how
TSMC is going to achieve these sorts of effects. And one part of that story is, how do you architect the shape of your transistors? The breakthrough before the most recent breakthrough is called the FinFET. Basically, this is like a thin-like structure that they bake into their transistors, and it works really well for reasons.
There's the gate all around transistor that's coming in the next cycle that's going to be way more efficient and blah, blah, blah. But bottom line is they're looking at like, how do we tweak the shape of the structure that the transistor is made up of to make it more effective and to make it work with smaller currents, to make it more from a power density standpoint, better thermal properties, better and so on and so forth.
But the separate piece is, what is the actual process itself of creating that structure? That process is basically a recipe. So this is the sweet sauce, the magic that really makes TSMC work.
If you are going to replicate what TSMC does, you need to follow basically the same iterative process that they do to get to their current recipe. This is like a chef that's iterated over and over and over with their ingredients to get a really good outcome.
You can think of a TSMC fab as a thing, a box with like 500 knobs on it, and you've got PhDs tweaking every single knob, and they're paid an ungodly amount of money, takes them huge amount of time. They'll start at the, you know, say, seven anime process node, and then based on what they've learned to get there, they iterate to get to the five, the three, the two, and so on.
And you really just have to do it hands-on. You have to climb your way down that hierarchy. Because the things you learn at seven nanometer help to shape what you do at five and three and two and so on. And this is one of the challenges with, for example, TSMC, just trying to spin up a new FAB starting at the leading node in North America, wherever. You can't really do that. It's best to start a couple of generations before and then kind of work your way down locally because
Even if you try to replicate what you're doing generally in another location. Dude air pressure humidity everything's a little bit different things break this is why by the way intel famously had a design philosophy for their fabs called copy exactly. And this was like famously a thing where you know everything down the color of the paint the bathrooms would have to be copied exactly the spec because nobody fucking knew why the.
frickin' yields from one fab were great and the other one were shit and it was just like, I don't know, maybe let's just not mess with anything, right? That was the game plan. And so TSMC has their own version of that. That tells you how hard this is to do, right? This is really, really tough, tough stuff. The actual process starts with a pure silicon wafer. So you get your wafer source. This is basically sand that has been purified and roughly speaking sand, glass.
And you put a film of oxide on top of it. This is like oxygen or water vapor that's just meant to protect the surface and block current leakage. And then what you're going to do is deposit on top of that a layer of a material that's meant to respond to light. This is called photoresist. And the idea behind photoresist is if you expose it to light,
Some parts of the photoresist will become soluble. You'll be able to remove them using some kind of process.
or, you know, others might harden. And depending, you might have positive photoresist or negative photoresist, depending on whether the part that's exposed either stays or is removed. But essentially, the photoresist is a thing that's able to retain the imprint of light that hits your wafer in a specific way, right? So by the way, the pure silicon wafer, that is a wafer. You're going to
We're ultimately going to make a whole bunch of dies on that wafer. We're going to make a whole bunch of, say, B200 dies on that one wafer. So the next step is, once you've laid down your photo resist, you're going to shoot a light source at a pattern, sometimes called a reticle or a photo mask, a pattern of your chip. And the light that goes through is going to encode that pattern, and it's going to image it onto the photo resist.
And there's going to be an exposed region. And you're going to replicate that pattern all through your wafer in a sort of raster scan type of way, right? And anyway, so you are going to then etch away. You're going to get rid of your sort of like photoresist. You'll then do steps like ion implantation where you use a little particle accelerator to fire like ions into your silicon to dope it because semiconductors need dopants like basically
Yeah, you make some imperfections and that turns out to mess with how the electrons go through the material and it's all magic, honestly.
To that point, I've copied exactly. This is another fun detail in case you don't know. One of the fundamental reasons TSMC is so dominant and why the rose to dominance is yield. So actually, you can't be perfect. It's a fundamental property of a vacation that some stuff won't work out. Some percent of your chips will be broken and not usable, and that's yield.
And if you get a yield of like 90%, that's really good. If only 10% of what you fabricate is broken. When you get smaller, and especially as you set up a new fab, your yield starts out bad. It's like inevitable. And TSFC is very good at getting the yield to improve rapidly.
And so that's a fundamental aspect of competition. If your yield is bad, you can't be economical and you lose. 100%. In fact, this is where when it comes to SMIC, which is TSMC's competitor in China, which, by the way, stole a bunch of TSMC's industrial secrets. In a very fun way, but yeah, there's some fun details there, for sure. Yeah, yeah, like lawsuits and all kinds of stuff.
But fundamentally, SMIC stole a lot of that information and replicated it quite successfully. They're now at the seven nanometer level, and they're working on five, but their yields are suspected to be pretty bad.
And one of the things is with China, the yields better or less because you have massive government subsidies of the fabrication industry. And so they can maybe get away with that to make the market competitive because the government of China has identified, or the CCP has identified this as a key strategic thing, so they're willing to just shovel money into the space.
But yeah, so this fabrication process has a lot of steps. By the way, a lot of them are cleaning, like a lot of them, just kind of polishing off surfaces, cleaning them to make sure everything is level. So there's a lot of boring stuff that goes on here.
Anyway, I work with a lot of guys who are very deep in this space, so I do like to nerd out on it, but I'll contain myself. The part of this process, though, that I think is sort of most useful to draw your attention to, is this idea of just shining a light source onto a reticle, onto this photo mask that contains the imprint of the circuit you want to print, essentially, onto your wafer.
So that light source and that whole the kind of optics around it, that is a huge, huge part of the of the trade craft here. So when you think about the things that make this hard, number one, there's the recipe. How do you do these many, many, many layers of kind of like, you know, photo mask and etching and, you know, lion implantation and, you know, deposition, all that jazz.
That know how that's what TSMC knows really really well, right? That's the thing. It's really really hard to copy But even if you could copy that you would still need the light source that allows you to do this
Photolithography, as it's called, the kind of exposure of specific patterns onto your wafer. And so those photolithography machines become absolutely critical in the AI supply chain, in the hardware supply chain. And there is really just one company that can do it well. And in a way, it's a complex of companies. So this is called ASML. This is in the Netherlands Dutch company.
they have this really interesting overlapping history with Carl Zeiss company and they are essentially kind of a complex of companies just because of ownership structure and overlapping talent and stuff like that. But through ASML Carl Zeiss complex,
So when we talk about photolithography, this very, very challenging stage of how do we put light onto our ship or onto our wafer such that it gives us with high fidelity, the pattern we're after, that is going to be done by photolithography machines produced by ASML.
And that brings us to the final stage of the game to talk about how the photolithography machines themselves work and why they're so important. Does that make sense or is there stuff that you wanted to add on the TSMC bit?
I think one thing we can mention real quick, since we were touching on process nodes is, you know, where does Moore's law fit into this? Well, if you look back a decade ago in 2011, we were at the 28 nanometer stage. Now we're getting into, like we were using five nanometer, roughly four AI, trying to get to two nanometer. And that
is not according to Marzla, right? Marzla has slowed down kind of empirically, like you just it's much slower at this relative to when you get to 80s or very early on to decrease, get to smaller process size. And that's why partially you have seen
The idea of CPUs having multiple cores, parallelization, and that's why GPUs are such a huge deal, even though we can't scale down and get to smaller process nodes as easily. It's like incredibly hard.
If you just engineer your GPU better, even without a higher density of transistors, by getting those cores to work better together, by combining them in different ways, by designing your chip in a certain way that gets you the source of jump and compute speed and capacity that you used to get just through getting smaller transistors.
Yeah, and it is the case also that thanks to things like FinFET and Gate All Around, we have seen a surprising robustness of even the fabrication process itself. So the five nanometer process first came out in 2020.
and then we were hitting three nanometers in early 2023. So, you know, like- It's not, yeah, it's still some juice to be squeezed, but it's slowing down, I think. Yeah, it's fair to say. Yeah, no, I think that's true. And you can actually look at the projections, by the way, because of the insane capital expenditure required to set up a new fab. Like TSMC can tell you what their schedule is for like the next three nodes, like going into 2028, 2029, that sort of thing.
And that's worth flagging, right? They were talking tens of billions of dollars to set up a new fab, like aircraft carriers with a risk capital. And it really is a risk capital, right? Because like Andre said, you build the fab and then you just kind of like hope that your yields are good and they probably won't be at first and like that's a scary time. And so, you know, this is a very, very high risk industry. TSMC is very close to base reality in terms of like unforgiving market exposure.
Right, so, okay, I guess, uh, photolithography, and this sort of, like, last and final glorious step in the process, we're really, we're gonna squeeze a lot of the high resolution into our, into our fabrication process. This is where a lot of that resolution comes from. So, let's start with
the DUV, the deep ultraviolet lithography machines that allowed us to get roughly to where we are today, roughly to the, let's say, seven nanometer node, arguably the five nanometer, there's some debate there.
When we talk about DUV, the first thing I want to draw your attention to is that there is a law in physics that says, roughly speaking, that the wavelength of your light is going to determine the level of precision with which you can make images, with which you can, in this case, imprint a pattern. So if you have a 193 nanometer light source, you're typically going to think, oh, well, I'll be in the hundreds of nanometers in terms of the resolution with which I can
I can sort of image stuff, right? Now, there's a whole bunch of stuff you can do to change that. You can use larger lenses. Essentially, what this does, it collects a lot more rays of that light. And by collecting more of those rays, you can focus more tightly or in more controlled ways and image better. But generally speaking,
your wavelength of your light is going to be a big, big factor and the size of your lens is going to be another. That's the numerical aperture. Sometimes it's described as.
So those are, anyway, those are the two kind of key components. 193 nanometers is the wavelength that's used for deep ultraviolet light. This is a big machine, costs millions and millions of dollars. It's got a bunch of lenses and mirrors in it. And ultimately, it ends up shining light onto this photo mask. And there's a bunch of interesting stuff about technologies like off-axis illumination.
and eventually immersion lithography and so on that get used here. But fundamentally, you're shining this laser and you're trying to be really clever about the lens work that you're using to get to these future sizes that might allow us to get to seven nanometers. You can go further than seven nanometers with DUV if you do this thing called multi-patterning.
So you take essentially your wafer and you go over it once and you go over it again with the same laser. And that allows you to kind of, let's say, do a first pass and then, not necessarily a corrective, but an improving pass on your die during the fabrication process, the challenge is that this reduces your throughput. It means that you have to, instead of passing over your
your wafer once, you've got to pass over it twice, or three times or four times. And that means that your output is going to be slower. And because your capital expenditure is so high, basically you're amortizing the cost of these insanely expensive photolithography machines over the number of wafers you can pump out. So slowing down your output really means reducing your profit margin very significantly.
And so SMIC is looking presumably at using multi-patterning like that to get to the five nanometer node. But again, that's going to effectively cost, in the same way as like yield is really bad, it's going to cost you throughput. And those things are really tricky. So that is the DUV machine. It allowed us to get to about seven nanometers.
But then at the five nanometer level, pretty quickly, you just need a new light source. And that's where EUV, extreme ultraviolet photography, comes in. It is a technology that has been promised forever. I don't know, 10 generations or something of TSMC processes where they're like, this is going to be the one they use EUV. And there's always some stupid shit that comes up, and then they can't ship it. So finally, we're at the EUV generation now. EUV light source is 13.5 nanometers.
It is really, really fucking cool. I'm just going to tell you how crazy this is. Somehow, you need to create 13.5 nanometer light. By the way, what I'm sharing here, there's a really great explainer of this that goes into much of this detail and has great illustrations on the Asianometry YouTube channel. Check that out. That's another great resource.
But so it turns out like, so back in the day, people realized that you could fire a laser at a tin plate, like a flat sheet of tin and get it to emit 13.5 nanometer light. 13.5 nanometers is like super, super, like extremely ultraviolet, very, very short wavelength high energy light.
The problem with that, though, is that what you tend to find is that the light is kind of going to fly out in all different directions. And you need to find a way to collect it somehow. So people went, OK, you know what? Like, let's experiment with concave tin plates. So we're going to shape a tin plate, kind of in the shape of a concave mirror, so that when we shine light at it, the light that we get back will hopefully be more focused, more, yeah, more not collimated, but more controlled, let's say.
So they tried that. The problem with that is, when you shine light on that concave tin plate, you get a bunch of sputtering, you get a bunch of vaporization of the tin. And so, yeah, you produce your 13-nanometer light, but that light gets absorbed by all these annoying tin particles that then get in the way.
So you're like, ah, shit, well, OK, now we're screwed. Tim doesn't work. But then somebody came up with this idea of using tin droplets. So here's what's actually going to happen. It's pretty fucked up inside an EUV machine. So you've got a tin droplet generator. This thing fires these tiny little like 100 micron tin droplets at about 80 meters a second. So they are flying through this thing. So tin droplets go flying as they're flying.
A pre-pulse laser is going to get shot at them and hit them to flatten them out, turning them into basically the plates, reflective plates that we want, getting them in the right shape. So you're a tin droplet, you're flying through a top speed, you get hit by laser pulse number one to get flattened.
And then in comes the main laser pulse from a CO2 laser that's going to vaporize you and have you emit your plasma. Now, because you're just a tiny tin droplet, there's not enough of you to vaporize that it'll get in the way of that 13.5 nanometer light so we can actually collect it. So that's like...
I mean, you are taking this, it's like hitting a bullet with another bullet twice in a row, right? You've got this tin droplet fly through crazy fast, prepulse laser flattens it out, then the next laser, boom, vaporize it, out comes the EUV light. And by the way, that has an overall conversion efficiency of about 6%, so like you're losing the vast majority of your power there, out comes the EUV light. And then it's going to start to hit a bunch of mirrors.
no lenses, just mirrors. Why? Because at 13.5 nanometers, basically everything is absorbent, including air itself. So now you've got to fucking have a vacuum chamber. This is all, by the way, happening a fucking vacuum because your life now sucks because you're making EUV laser. So you've got a vacuum chamber because
air will absorb shit and you're not allowed to use lenses. Instead, you've got to find a way to use mirrors because your life sucks. Everything in here is just mirrors. There's like about a dozen, just under a dozen mirrors in an UV system.
All they're they're trying to basically replicate what lenses do like you're trying to focus light with mirrors which Based on my optics background. I mean like that is a hard thing to do There's a lot of interesting jiggery poker that gets gets down here including poking holes in mirrors So you can let light go through like mostly and hopefully not get too lost anyway It's a mess, but it's really cool But it's a mess and so you've got these like 12 mirrors or 11 mirrors or 10 mirrors depending on the configuration
desperately trying to kind of collect and pull this. It's all happening in vacuum. Finally, it hits your photo mask and even your photo mask has to be reflective because light would just be absorbed in any kind of transmissive material. And so anyway, this creates so many painful, painful problems. You're literally not able to have any what are called refractive elements. In other words, lens-like elements where the light just goes through, gets focused and blah,
No, everything has to be reflective all the time. And that, that is a, it's a giant pain in the butt. It's a big part of the reason why these machines are a lot harder to build and a lot more expensive. But that is EuV versus DuV. It seems like all you're doing is changing the wavelength of the light. But when you do that, all of a sudden, like you'll find, so, so even these mirrors, by the way,
are about 70% reflective, which means about 30% of the light gets absorbed. And if you've got 10 or 11 multilayer mirrors, then all the way through, you're going to end up with just 2% transmission. Like if 30% of light gets lost at mirror 1, 30% mirror 2, if you work that through with 10 mirrors, you get to about 2% transmission. So you're getting really, really crap efficiency on all the power you're putting into your system.
By the way, the CO2 laser is so big, it's got to be under the floor of the room where you're doing all this stuff. This whole thing is a giant, giant pain in the butt, and that's part of the challenge. That is EUV.
There's also like pi numerical aperture, which is the next beat that basically just involves using effectively bigger lenses, like tweaking your mirror configuration, because you're in UV to effectively kind of, anyway, collect more, more rays of light.
So you can focus down more tightly. The problem with that is that all the setup, all the semiconductor fabrication setup assumes a certain size of optics. And so when you go about changing that, you've got to refactor a whole bunch of stuff. You can't image the whole photo mask at once, the size of the photo mask that you can actually image. In other words, the size of the circuit, you can imprint on your
chip drops by about 50%. So now, if you want to make the same chip, you've got to stitch together two kind of photo masks, if you will, rather than just having one clean circuit that you're printing, you're going to stitch together two of them. And how do you actually get these insanely high resolution circuits to line up in just the right way? That's its own giant pain in the butt with a whole bunch of interesting implications for the whole supply chain there. I'm going to stop talking, but the bottom line is,
EUV is a big, big leap forward from DUV, and it's what China right now is completely missing. So export controls have fully prevented China from accessing EUV machines, let alone high in EUV. So they're all on DUV. They're trying to do multi-patterning to match what we can do at TSMC and other places with EUV.
Yeah, I think you did a great job conveying just how insane these technologies are. Like, you know, once you realize how absurd what's going on is in terms of precision, it's pretty mind-blowing.
And I think it also brings us to maybe the last point we'll get to and at large part of why we're doing this episode is when it comes to export controls, maybe we can get dive into like what are they, like what is being controlled and how does it relate to fabrication, to chips and so on. Yeah, actually, great question, right? It's almost like people treated as a given, like we're going to export control expert, but what are you export controlling?
So there's a whole bunch of different things. So the first, you go through the supply chain, basically, and you can make sense of it a bit more. The first is, hey, let's prevent China from getting their hands on these UV lithography machines, right? They can't build them domestically. They don't have a Carl Zeiss. They don't have an ASML. So we can presumably cut them off from that. And hopefully that just makes it really hard for them to domesticate their own photolithography industry.
Secondly, as a sort of defense-in-depth strategy, maybe we can also try to block them off from accessing TSMC's outputs. So in other words, prevent them from designing a chip and then sending it off to TSMC for fabrication. Because right now, that's what happens in the West. In video say, designs a new chip, they send the design to TSMC, TSMC fabs the chip,
and then maybe packages it or whatever it gets packaged, and then they send it off. But what you could try to do is prevent China from accessing essentially TSMC's outputs. Historically, China's been able to enjoy access to both whatever machines ASML has pumped out and to whatever TSMC could do with those machines. So they could just send a design to TSMC, have it fabbed, and there you go.
But in the last couple years as export controls have come in gradually the doors been closed on accessing frontier chips and then increasingly on photolithography such that again there's there's not a single you the machine in china right now.
By the way, these EUV machines also need to be constantly maintenance. So even if there were an EUV machine in China, one strategy you could use is just like make it illegal to send the repair crews, send the 20 or so people who are needed to keep it up and running to China. And presumably that would at least make that one machine less valuable. And they could still reverse engineer and all that. But the fabrication is part of the magic. So those kind of two layers are pretty standard. And then you can also prevent
companies from in China from just buying the finished product, the NVIDIA GPU, for example, or the server, right? And so these three layers are being targeted by export control measures. Those are maybe the three kind of main ones that people think about is photolithography machines, TSMC chip fab output, and then even the final product from companies like say NVIDIA.
The interesting thing, by the way, that you're starting to see, and this bears mentioning in the space, too, is like, NVIDIA used to be the only designer, really, I mean, for frontier, for cutting-edge GPUs. What you're starting to see increasingly is, as different AI companies, like Anthropic, like OpenAI, are starting to bet big on different architectures and training strategies,
their need for specialized AI hardware is starting to evolve, such that when you look at the kinds of servers that anthropic is going to be using, you're seeing a much more GPU-heavy set of servers than the ones that OpenAI is looking at, which you're starting to veer more towards the kind of like 2 to 1 GPU to CPU ratio. And that's for interesting reasons that have to do with OpenAI thinking, well, maybe we can use
We need more verifiers. We want to lean into using verifiers to validate certain outputs of chains of thought and things like that. And so if we do that, we're going to be more CPU heavy. And anyway, blah, blah. So you're starting to see custom Asics.
the need for custom chips develop with these frontier labs and increasingly like opening eyes developing their own chip and obviously Microsoft has its own chip lens and Amazon has its own chip line that they're developing with anthropic and so on and so we're going to see increasingly bespoke hardware
And that's going to result in firms like Broadcom being brought in. Broadcom specializes in basically saying, hey, you have a need for a specific new kind of chip architecture will help you design it. Will be your Nvidia for the purpose of this chip. That's how Google got their TPU off the ground back in the day. And it's now how opening I apparently, reportedly, we talked about this last week,
is building their own kind of new generation of custom chips. So Broadcom likes to partner with folks like that, and then they'll of course ship that design out to TSMC for fabrication on whatever node they choose for that design. So anyway, that's kind of the big design ecosystem in a nutshell.
Yeah, and yet another fun historical, well, I guess, interesting historical detail. I don't know if it's fun. TSMC is unique or was unique when it was starting out as a company that just provided fabrication. So a company that could design a chip and then just
ask TSMC to fabricate it. TSMC promised not to then use your design to make a competing product. So prior to a TSMC, you had companies like Intel that had fabrication technology. Intel was making money from selling chips from CPUs and so on, right? TSMC, their core business was taking designs from other people, fabricating with chip, getting it to you and nothing else. We're not going to
you know, make GPUs or whatever. And that is why NVIDIA, well, why NVIDIA could even go to them, right? NVIDIA could not sort of ask a potential competitor, let's say AMD, I don't know if AMD does for vacation. But anyway, it could be the case that they do some design in-house and then contract to TSMC to have make with chips. And as you often find out, TSMC has a limited capacity for who it can make chips for.
So, you know, you might want to start a competitor, but you can't just like call TSMC and be like, hey, can you make some chips for me?
It's not that simple. And one of the advantages of NVIDIA is this very, very well established relationship going back to even the beginnings of NVIDIA, right? They very fortuously struck a deal very early on. That's how they got off the ground by getting TSMC to be their fabrication partner.
So we have a very deep, close relationship and have a pretty significant advantage because of that. Yeah, absolutely. You grew a great point to call that out, right? TSMC is famous for being the first, as it's known, pure play foundry, right? That's kind of the term you'll also hear about, like, so fabulous.
So, fabulous chip designers, right? That's the other side of the coin, like Nvidia. Nvidia doesn't fab. They design. They're a fabulous designer. Whereas, yeah, TSMC is a pure play foundry, so they just fab. It kind of makes sense when you look at the insane capital expenditures and the risks involved in this stuff. Like, you just can't focus on both things. And the classic example, it's your point of, you know, Nvidia can't go to AMD. So, AMD is fabulous, but
But Intel isn't. And Intel tries to fab for other companies. And that always creates this tension where, yeah, of course Nvidia is going to look at Intel and be like, fuck you guys, you're coming out with whatever it is, Arrow Lake or a bunch of AI optimized designs. Those ultimately are meant to compete with us on design. So of course, we're not going to give you our fab business. We're going to go to our partners at TSMC. So it's almost like the economy wants these things to be separate.