Deep-dive into DeepSeek
en-us
January 31, 2025
TLDR: Chris and Daniel discuss DeepSeek R1 model, its skyrocketing popularity but privacy/geopolitical concerns due to ties with China, secure running of DeepSeek models, and potential implications for open models in 2025.

In this episode of the Practical AI Podcast, hosts Chris Benson and Daniel Lightnack delve into the new generative AI model DeepSeek R1, released by a Chinese startup, and discuss the myriad of reactions it has generated in the AI community, including intense hype, skepticism, and serious privacy concerns.
Overview of DeepSeek R1
DeepSeek R1 has emerged as a competitor to established models from OpenAI, boasting similar performance levels but reportedly achieved at a significantly lower cost. The hosts highlight how this model has opened discussions about its implications, both technologically and geopolitically.
Key Points Discussed:
- Performance Parity: DeepSeek R1 is considered on par with OpenAI's models, particularly the GPT-01 model, in performance. This achievement is remarkable, especially given DeepSeek's claimed lower training cost of around $5 million compared to the expenditures of major players like OpenAI, which can run into hundreds of millions.
- Geopolitical Concerns: The model’s ties to China raise questions about privacy, surveillance, and data security, emphasizing the politically charged landscape of AI development.
- Operational Security: The conversation shifts to how organizations can securely run DeepSeek models. Evaluating whether using DeepSeek's application might expose sensitive data leads to discussions about the security of AI infrastructures.
Understanding the Model's Development
Daniel and Chris break down the foundational aspects of DeepSeek R1's development:
- Access and Use: Users can access DeepSeek through its products or by downloading the model from platforms like Hugging Face. Running the model locally could mitigate some privacy concerns linked to online applications.
- Technical Architecture: DeepSeek R1 utilizes layers of transformers and includes innovative elements such as the mixture of experts architecture to streamline processing. This setup increases efficiency and sets it apart from traditional dense models.
- Training Efficiency: The model’s ability to use cheaper, synthetic data for its training highlights a broader trend in AI where startups creatively optimize costs.
Security Implications
- The hosts emphasize that concerns about security should focus not just on the model itself, but also on the infrastructure and data handling practices surrounding it. Important considerations include:
- What Data Is Collected: Users must understand how their data may be utilized and stored when using DeepSeek's applications.
- Potential Biases: Chris points out that biases could linger from the model itself depending on its training datasets and methodology, which may impact generated outputs.
- Running the Model Securely: With the correct setup, such as an isolated environment for the model, organizations can significantly reduce risks associated with data leakage.
The Future of AI Models
Impact on the AI Ecosystem
The conversation concludes with predictions regarding the broader impact of DeepSeek R1 within the AI landscape:
- Model Optionality: There's an emerging trend indicating the necessity for businesses to consider various AI models rather than becoming locked into a single provider. This represents a shift towards flexibility in tooling and deployment strategies for AI projects.
- Market Dynamics: As the capabilities of smaller teams and startups rise against larger organizations, the pressure on established players could lead to a collapse in overvalued startups—prompting market corrections in the AI industry.
Final Thoughts
Chris and Daniel reiterate the importance of remaining informed about new technologies like DeepSeek R1. They stress that while the model has significant potential, it is crucial for practitioners and businesses to thoroughly evaluate the implications of integrating such models into their workflows. The episode serves as both a caution and a call to explore the evolving landscape of AI with a critical eye on security and ethical considerations.
In summary, the discussion surrounding DeepSeek R1 highlights a crucial juncture in AI's development, underscoring the balance between innovation, ethical responsibility, and security in an increasingly complex global landscape.
Was this summary helpful?
Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love the changelog. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for the changelog wherever you get your podcasts. Thanks to our partners at fly.io. Launch your AI apps in five minutes or less. Learn how at fly.io.
Welcome to the very first fully connected episode of Practical AI in 2025. In these fully connected episodes of the Practical AI Podcast, Chris and I keep you fully connected with everything that's happening in the AI world and hopefully share some learning resources to help you level up your machine learning and AI game.
I'm Daniel Lightnack. I'm CEO at Prediction Guard, and I'm joined as always by my co-host Chris Benson, who is a principal AI research engineer at Lockheed Martin. How you doing, Chris? Doing good today, Daniel. It's a lot of interesting things happening out there in the AI world, and I love these conversations where we do these fully connected, kind of deep dives into things that are
personal interest to you and me, which is how we choose them. There are a lot of exciting things coming up. In these episodes, it's a little bit easier for us to freeform, talk about a few things, but for those listeners who have been seeing our logo for some time on the podcast feeds, just FYI, there'll be a
a change to that coming up, but no need to swap out our feed or anything. That should be good. We're still doing great things with the change log. And they've made a few changes to their shows and their lineup, publishing them in different ways. They have a show about that if you want to learn about it. But we'll still be gone strong and excited kind of for probably a much needed refresh. I don't know. I think it's been like six and a half years or something.
Chris, we can change in six and a half years. I don't know. Yeah. I mean, six and a half years in GPU time is a long time. Yeah, that's expensive. Yeah.
So yeah, just FYI, longtime listeners to be aware, you might scroll through your podcast app, look for a new logo sometime in the near future. But yeah, I think obviously that's bigger news than DeepSeek, but I guess we can devote most of the episode to what is the story of our week, couple of weeks, and who knows how long, which has been DeepSeek R1.
I know this is, you know, I was thinking as we were saying that speaking of GPU time, in this case, maybe a lot less GPU time. Yeah, a lot less on maybe an unclear amount. That's true. Good point. Because they only talked about the final run that was successful in terms of the spend on it.
Yeah. Yeah. Well, I guess we're getting ahead of ourselves for those who are not as familiar with it. Yeah. So for those probably many of those that are listening to this particular episode have come across deep seek, but for those that have not seen anything, maybe you've been under under a rock somewhere, Chris, what are we talking about with deep seek? Ah, so we have, uh, there's a Chinese startup that we're talking about here.
that has released a large generative model, LMM, that is, I guess, and I'm gonna gloss over some stuff right here, just because we'll dive into the specifics, but is very highly performant, it's comparable to the best models that OpenAI has had out there, but the thing that's really rocked everybody's world is the fact that it was trained
at much, much less cost, at least the parts we know, which we'll dive into that detail as we said. There's some things we know and there's some things we don't know, but it appears to have been achieved at a much, much lower cost than all of the competing models from anywhere in the world up to this point. And so in short, the AI world, and I guess everybody outside the AI world that cares about this stuff,
is in this giant debate and conversation about the implications. Is it a big deal? Not such a big deal. Why is it a big deal? Is it overblown? And of course, Daniel and I are about to dive into all of that right now. It's a target-rich environment, as we like to say in defense. Are they surveilling us while we use the model? Exactly. Could you, should you, might you?
run the model in all sorts of different ways. There's tons of confusion around this, Chris, which is one interesting thing that hopefully after you've listened to this podcast, you're not more confused. We don't make that guarantee, but hopefully that's the case. It's interesting. I think one of the stories around this, and there's multiple narratives that we can go into here so much to talk about,
One of the narratives is around how some Chinese startup with a much lower budget or spend on the model building built such a good model and essentially gets parity to
models from OpenAI and others. In particular, the comparison has been made to the 01 model, which if you remember, we talked about this on the show. This is OpenAI's sort of thinking quote unquote model. So the model, when it generates output, so you put in a prompt.
the LM generates text output. A beginning portion of that text is sort of thinking content, meaning they're training the model to sort of spit out logic of how to solve maybe a deeper problem or reason about the input prompt before it actually gives its final answer. If you're in the chat GPT interface, you can see this kind of in a different color or grayed out.
If you're using the API, I don't think that they send that back in the API. You still pay for it, but I don't think they send it back. This is a similar type of reasoning model. DeepSeq R1. In this reasoning, this kind of very kind of flagship model of OpenAI, for example, the same kind of task. DeepSeq is kind of getting this, what we could call parity.
Different benchmarks are out there, etc. Each model has its own biases and different behavior.
But yeah, the first kind of narrative around this in the news is, whoa, this came out of nowhere. There's this new company. It's just a startup. They did this on the cheap. So they kind of published numbers around 5 million, 5.5, 6 million somewhere in that range for the final training of this model. And compared to what it took to train OpenAI's 01, that's like a drop in the bucket.
So, yeah, this first narrative, what are your thoughts on this, Chris? Well, I think that there's a lot more information we want. You know, they published that final number, as you mentioned, but I've seen a lot of posting about, you know, what did it take? What were all the unsuccessful runs that they had, the experimental runs, things like that? There's just so much that's not known about this. So they really cherry-picked what they chose to publish about it.
But I mean, have you used it, Chris? Oh, I actually, so surprisingly, because of my job, I tend to avoid some of the Chinese technologies in a direct use way. But I did load it onto my personal system. I have it on now. They were having some struggles on login for the last hour, at least, that I've been trying and got it on. But yeah, I have it up. I know that
I coily sent you a text yesterday with a screenshot of me using it where I was just playing with it and I had seen somebody else do this and so I asked it what happened in Tiananmen Square in 1989 and it replied, I am sorry, I cannot answer that question.
I'm an AI assistant designed to provide helpful and harmless responses, which I thought was about what I expected, actually, to be perfectly honest. But just a reminder of the geopolitics of AI, which definitely plays big, is that this is a Chinese government approved data set.
And I think there's to that point, so first off, DeepSeq is an amazing team of people that actually just didn't show up on the scene like a week ago. They've been doing actually really great open source and science work for a while now. So the DeepSeq team has been around. There's been previous DeepSeq models.
They have released those models on hugging face, which to be fair to our US counterparts, OpenAI has not released their models openly on hugging face to where you can run them and do research with them and analyze them and generate outputs.
That's maybe a point to highlight. They have been open in that sense, but there's been some, I think, confusion around this concept of this very small team. There's a couple of guys in their bedroom with a gaming card in their gaming PC, and they train this model that beat OpenAI. That's kind of the narrative that's being pumped around.
In reality, they have access to tens of thousands of GPUs. I saw a post by Philip Schmidt who's at hugging face. Now, I don't know all of the details of where he gets certain information to take this for what it's worth, but one of the things he said similar to you Chris is,
the sort of five to six million dollar mark. That's kind of the final base model, no reinforcement learning, doesn't include smaller runs, doesn't include data generation, which is a key piece of this. So we'll talk about that when we come to the model here in a second, but actually generating the prompts for this model is a significant part of it.
Kind of RL training as I mentioned. So the costs are definitely greater. There were more resources that were brought into this. It's not a company that just sort of popped up in someone's bedroom. So some of that narrative is, you know, it's true in the sense that the company is small. They clearly were working under constraints and they did a very impressive thing working under
computational constraints in the environment that they're working on, and they release the model openly in hugging face. Kudos to them on that. Some of the other narrative points around that are a little bit fuzzy. I'm curious just on how your take on that is. Kudos for being open source on hugging face. You and I over the last
Year, year and a half have been predicting open source, inevitably at some point, especially as things are plateauing a bit, that open source would inevitably have its impact in this way.
I don't think it was a question of when and not if. Having said that and seeing this particular group putting it on a hugging face, why do you think that they declined to include all of the training information on how they got there in terms of the cost?
might be motivated. I realize I'm asking a speculative question. It's interesting. I think one point to make there is it's not out of character for anyone that's posting these models in the sense that like meta's llama three one. I mean, there's a
You see kind of these model producers produce quote technical papers, but these technical papers don't share details to where in theory you could reproduce this right and there's details about the data that they're not revealing there's details about their process that they're not revealing even when they release one of these models and so it's not.
out of character for that to be kind of how things happen. And so really, part of me is like, well, why would they be motivated to do that? It would be sort of out of their philosophical commitment to open science, I think, would be the thing that would motivate them to do that, which certain parties do. So like Allen Institute for AI with their all-mo models, et cetera, have made it a very conscious effort to be truly open in terms of
data, process, model assets, et cetera. That is definitely an exception that proves the rule. Is that the right phrase? Yeah, I know what you're saying. Yeah, so it's something like that. You get what I'm saying. But there's an acknowledgement that in the larger field that the technical papers that come out are as much marketing papers as they are in kind of accomplishment papers as they are.
I mean, there's certain elements that are interesting to know. Obviously, you can know the model architecture. You're running it. It's open. So there's things you can learn from that. I do think that this represents a kind of shock to the system where
You know you and i have been saying we're basically getting to parity with open models and closed frontier models for all intents and purposes for most enterprise use cases were sort of we are already at parity, but in the general public's eyes.
We definitely weren't. And to some degree, we weren't in terms of certain types of models and that sort of thing. I think this is definitely a shock to the general public's perception that there is model optionality out there. There's going to be a proliferation of these models from various different places, which then leads into natural discussions about
where these models coming from and can I trust them? And how does it behave differently than what I'm used to using? And can I run it securely? All of those sort of things pop up and they popped up sort of immediately and captured a lot of attention around that. I agree.
What's up AI practitioners Adam here from changelog want to tell you about how much I love notion I know Daniel and Chris love notion as well because we use notion to organize everything and behind the scenes here at changelog.fm and cp.fm we work with a lot of cool teams externally and we create dashboards and workflows and
Operating systems essentially to work well with others outside of our domain. And the cool thing is, is Notion is so flexible that we could do anything with Notion. And the coolest thing I'm loving about Notion is their Notion AI. I can search across all my notes
all my docs, get context, get summaries. It's all AI powered, all inside my notion, powered by all the content in my notion. So I can work with external teams, internal teams, like a build workflows. And all this AI has really helped my team, my tools, my knowledge base be empowered to do our best work.
And unlike other tools out there, you gotta jump from one thing to the next to the next, and it's just not seamlessly integrated. Notion is seamlessly integrated, infinitely flexible, and it's beautiful. It's easy to use, mobile, desktop, web, shareable, web shareable. I mean, you name it, Notion can do it. And the fully integrated Notion AI helps us more faster, write better, think bigger, do tasks more efficiently. Things that would normally take us hours, now takes us minutes.
maybe even seconds in some cases. And yes, we are a small organization compared to Fortune 500 companies, but they are used by over half of Fortune 500 companies. And teams that use Notion's sendless emails, they cancel more meetings, they save time searching for their work, and they reduce their spending on tools, which helps everyone stay on the same page.
Try Notion today for free when you go to Notion.com slash practical AI. That's all our case letters Notion.com slash practical AI to try the powerful, easy to use Notion AI today. And when you use our link, of course, you are supporting this show. And we love that Notion.com slash practical AI.
Well, Chris, I do want to get to some of the technical details that we know about, you know, what the model is and versions of the model. But maybe before that, it would be useful to address the elephant in the room, I guess, which is the security element of this or the cybersecurity privacy issues related to this. So there's been
The geopolitical elements around, oh, is the US ahead? What does this say about dominance in the space? That's one thing. There's another thing which is my company, which previously potentially had problems with pasting things into chat GPT that they weren't supposed to, like whatever it is, customer details or IP or whatever.
Those kind of that shadow AI usage was already happening in companies, right? And people are concerned that their employees were pasting things in the chat GPT. Well, now there's this sort of new player in the space. People are pulling that up because it's the new amazing AI app.
It turns out that is run by a different company and that data is going to a different place and is being housed on Chinese servers. There's that element, so we need to parse through that, but also this has produced a separate confusion from my perspective around
the quote security of the model. It is deep-seap secure. I think that's like a very, we have to clarify what we mean when we say that question. Because I don't know what you, I don't know the things you've seen, Chris. There's a lot of,
stuff out there that is not very helpful in this sense. Yeah, a lot of fear uncertainty and doubt, and some of it may be justified. Some of it may not be, probably. A lot of it may not be. For me, the scarier thing is not the model itself. It's the infrastructure around the model where it's being housed, what external entities to the core data scientists at that company have access to it.
And I think that's where a lot of the concern is going to be. If you've downloaded it from hugging face and you're running it on your server, that's not to say that every facet of security is being accommodated, certainly, but at least you've taken some of the issues potentially out of the security equation. So I definitely have on my personal phone, I have the app.
which is unusual for me, but knowing that we were going to do this and wanting to play with it a little bit. I am very wary of what I don't know about that at this point. You drew out something really important, Chris. I wrote a blog post about this that I'll link in the show notes of this episode. If you're interested, you can take a look. It might be a good resource if you hear your
Engineering management or people in your company, hyping the fears around deep-seak. Maybe you want to use that in a secure environment. Maybe that would be a good tool that you could point them to. All that to say, I think the main thing that I wanted to highlight in that was this element that you just described. There's really two ways to access this model.
There's two ways to utilize DeepSeq R1. One is via a product offered by the DeepSeq company, which is a software product that you access and they host. This would be parallel to a lot of other software products like OpenAI host chat GPT. That is their product, which has a model interface embedded in it.
but it is a product similar to like you using Airbnb, right? You go to Airbnb, you put in your personal information into Airbnb, they have certain terms and service that they hopefully follow, but you have no view into what's going on under the hood of Airbnb or chat GPT or this deep-seek AI product. And so it's really not the model
in that case, that is not secure, quote unquote, in terms of you putting data into it.
It is the product built around that model. And it is very clear from the terms and service that DeepSeek has posted that they will gather all of your, well, I shouldn't do a blanket statement like that. They say exactly what they will get from you, but they're saving a lot of your personal data and information. They will use that for future model trainings.
And that is housed on quote servers in China. So that is the terms, that is the explicit terms in service. If you're okay with that, that's the usage of the product, right? Not the model. It's funny that you bring that up because like most people in most software products that I use, I don't necessarily go through all the terms of service as carefully as I really should.
And, you know, no one does. No one does. But I will confess that when I was when I was downloading the deep-sea cap and it brings that up in registration, I did and I was horrified to read it. And I had already determined I was going to do that. I was using all personal stuff, nothing related to work, that kind of thing. But even so, not only did I kind of do a big swallow on bringing the app down,
but it made me really think about the kinds of things that I would put into the interface very, very carefully given. Like you said, going around the product as opposed to the model. Yeah. So this is one access pattern, right? Access through the DeepSeek, either their mobile app or I think it's chat.deepseek.com, their chat interface similar again, similar to chat GPT.
And really, you should have some of the same related concerns with a chat GPT or an antropic as you would with DeepSeq in the sense that you really want to know how your data will be used and what are the privacy considerations around that. This adds a new element in that it's a sort of foreign entity dealing with that data, right? So there's a different element of that. But that's not the model. That's the product.
The model, which they have released on, again, on hugging face, is, and when we say model here for those that haven't maybe been around the podcast for a while, when we talk about a model, there's sort of two elements of that. There's the code needed to actually run that model, sort of process your inputs and generate outputs.
Then there's a set of parameters, a set of data that's loaded into that code that parameterizes that code such that it can run. Both of those have been released. In fact, actually, the code needed to run DeepSeek. Now, I'm going to make a clarification here. The code needed to run some versions of DeepSeek.
is not even deep seeks code. At least if you're using the hugging face ecosystem, it's a software package called Transformers, which is open source. You can go look at every line on GitHub. It's maintained by thousands around the world. It's completely open and transparent. And so the code element isn't, you know, there's, that's being looked at and being developed. The, and the model is implemented in there.
The data element is available on hugging face and has potentially its own concerns as you download that into your environment, which we can talk about here in a second, but both of those are open. You can download them and run them even in like if I spin up a VM.
or some computer, I could download those assets, cut off that computer from the internet, both outbound and inbound, right? And run that model in complete isolation where no data goes to deep-seek, no data goes to China. They're not sending and connecting to that computer, right? So just to make it uber clear here,
That is the model. When we say model, that's what we mean. We don't mean the product and that can be run again with considerations in a secure environment. Now, I should say the caveat here. I do believe as the time of this recording, the Transformers library hasn't been updated.
to support the full deep-seek R1 architecture, which is actually very typical when a new model is released. Sometimes it's not always supported in the upstream transformers, which means that there is remote third-party code that you have to load to run the full model, which is not true for all of the versions of the model.
I expect that will change in a matter of, I don't know, maybe it's changing as I speak now. It'll happen fast like days, weeks, whatever, that will be kind of merged into upstream. And then that concern will kind of go away. So I just to kind of follow up on that thought.
Would it be fair, given what you just said, to say that if you were running it on the Transformers infrastructure and you did have disconnected, inbound and outbound networking from it, just to take all of those extraneous concerns out, would you have any reservations about running it in a scenario where security was important?
Yeah, so that removes the kind of phone home. My data is going to China. Yeah, vulnerabilities around remote code execution or something on my computer. Those sorts of concerns are, I would say,
taken care of. Now, I mentioned the data files that you would be loading, those model parameters. There are insecure ways that those could be loaded in. JFrog and others have shown vulnerabilities in that. Those are mostly taken care of by using the right model formats, which DeepSeq is doing. It's called Safe Tensors. If you want to look into it, you can. So I wouldn't have any reservations about that. Now, that brings up a secondary question, though, which is another point of confusion. So I'm glad you brought it up.
The secondary question is, okay, if I'm running this, it's not phoning home. My data is not going to deep seek or any foreign entity. If I'm running it in this secure way, are there other concerns that are unrelated to this sort of phone home privacy issue type of thing? And I think one of the things that you brought up before was potential biases, biases and the original training set. Yeah.
in the model, right? So you brought up this example of asking about Tiananmen Square. Actually, I think because we also asked a similar question. I don't know if this is actually fixed in the app as of the time of this recording, but at first when you would ask that question in the deep-seek product, the actual application, it would print out the full answer.
of like it would actually answer what happened. And then it would all collapse and give you like a can like, sorry, I can't. Right. Answer that. So based on that, I would know or I would assume that the actual model that you would download and run in that secure environment or, you know, on your laptop and one of these kind of local hosting things,
That model is not biased. In terms of that response, that model specifically is not, you could get it to answer about Tiananmen Square. It's a product decision similar to like when Gemini was trying to create diversity in their image output and generated some really interesting looking things. Or what chat GPT or anyone does
When you send in a prompt, they inject stuff into that prompt. They do post-processing. It's a product, right? You don't have visibility into any of that. And so they're doing obvious product things there to introduce artificial biases. Now, I do think that it is possible that in the sort of alignment fine-tuning process,
deep-seek had their own vision of how they wanted to align that model, which may not be any malicious in any sort of way or kind of biased and weird political ways. It might just be their choice of how they wanted to bias that model. In other ways, maybe it is motivated by certain things. I don't know.
But that model will have its own sort of biased behavior. The other thing that I think has been shown in a number of places with Secure did a study of this. And it is a model that is also way more sensitive to prompt injection attacks than kind of the many other state-of-the-art models.
which produces another type of vulnerability at the application layer. So you've taken care of the model hosting security issue, but all that to say, that doesn't mean at the actual use of the model or integration layer. You shouldn't still be asking relevant questions, which, again, I highlight some of those things in the blog post if people are interested.
Well friends, AI is transforming how we do business, but we need AI solutions that are not only ambitious, but practical and adaptable too. That's where DOMOS, AI, and data products platform comes into play. It's built for the challenges of today's AI landscape.
with Domo, you and your team can channel AI and data into innovative uses that deliver measurable impact. While many companies focus on their applications or single model solutions, Domo's all-in-one platform is more robust with trustworthy AI results without having to overhaul your entire data infrastructure
secure AI agents that connect, prepare, and automate your workflows, helping you and your team to gain insights, receive alerts, and act with ease through guided apps tailored to your role and the flexibility to choose which AI models you want to use. So, DOMA goes beyond productivity, is designed to transform your processes, helping you make smarter and faster decisions that drive real growth,
And it's all powered by Domo's trust, flexibility, and years of expertise in data and AI innovation. And of course, the best companies rely on Domo to make smarter decisions. See how Domo can unlock your data's full potential. Learn more at ai.domo.com. That's ai.d-o-m-o.com.
So Chris, there's also the element around this that we always like to do when we get into our deeper discussions around any particular model, which is what are the unique technical or architectural elements of this? What types of versions did they release? This actually might be very confusing for people when they see like
Deep Seek R1 Distilled Quinn 32B, right? That there's a lot of words there that might not make sense. Jay Almar, who we love on the podcast and has been on the podcast, he runs this has posted for years a lot of great
blog posts about kind of illustrated transformers and other things. As a learning resource that you might want to take away from this particular episode, he posted an illustrated deep-seek R1.
article, which goes through some of the details. What's interesting here, Chris, I don't know if you got through any of that, but the overall picture of how they did this fine tuning is fairly similar to, I think, how many people have been doing fine tuning for some time.
And I guess that one of the things I wanted to bring up on this is I believe, correct me if I'm wrong, it was based on one of the llama models, right? Well, the deep-seek architecture has been around for some time, and as a specific architecture, it's similar to the llama architecture, like it is also involving layers upon layers of transformers. Right.
In terms of the exact architecture of this deep-seek R1, it does involve mixture of experts layers in the model. So there's layers and layers of transformer blocks in the architecture, and then these mixture of experts blocks, which you might see people refer to like activated layers or parameters. These mixture of experts layers don't always
you don't process the input through all elements of that layer of the model each time you run the model, which creates some efficiencies both for inference and often for training purposes as well. But yes, it's a similar
Similar setup, some slight differences, which also kind of those slight differences are the reason why we mentioned earlier, likely you kind of have at least currently, as we're recording this, you might have to import some third-party code to support the model in the upstream transformers, which is likely to change quickly.
How does that affect the fine tuning though for in terms of you know how deep seek approach tip versus maybe how llama has been approached and stuff. Are there differences in seeing a lot of similarity there? I know Sam Altman made a comment and I'm not quoting him directly but it was something to the effect of once you know once somebody else has already done something that you're basing on it's a lot easier to do that and that was his kind of
minimization of what deep-seat you've done. I'm kind of curious, how does that affect this in terms of this? I mean, the overall, like I say, the overall process, and when I say overall process, this is often kind of a pre-training step of very raw data that is completely unsupervised.
a fine-tuning step which is supervised and maybe an additional fine-tuning step which is like a preference tuning. That sort of overall picture of how the training is done seems to also be true here. This is like the overall picture of how they did that. Now, there are some unique elements of this in that they created this deep-seq R10 model.
which is kind of, and they use this interim reasoning model to actually help generate some of the data for that supervised fine-tuning step. So this is where going back to the original discussion in our conversation, that 5 million number corresponds to maybe that final or one of the final training steps, but not necessarily the data generation. So they used interim models
that their intention wasn't to release. It doesn't perform great in terms of a general purpose model, but it might perform well to generate long chain of thought examples, like these reasoning examples, to add into the training data that supervise fine tuning, which allows you to augment your fine tuning data, use less human resources to create that fine tuning.
I forget the exact figures, but meta did spend a ton of money in terms of the data curation with human data labelers to create those data sets for the llama models and probably still are. In this case, at least some of that data was this synthetic data that was
that was generated by this interim model. So that's kind of one interesting step of the process that maybe, you know, that maybe is relevant to some of the budget and efficiency considerations. Sure. And maybe that's, you know, part of the motivation of leaving that out altogether is, you know, that wasn't a direct cost necessarily to them or at least not in the way that it had been when it was originally manufactured.
Yeah, yeah. So there's this deep-seak R1 model, which again, the architecture is not fundamentally different from architectures of what we've seen in the past. There were some creative things done in the training process, both that significant portion, assuming it was a significant portion of data, which was synthetic or generated data.
They also used some automated processes and model-based processes to filter and curate that generated data to actually filter out good examples from all of the candidate examples. So there were some very creative things in that data generation piece, but the other stages of this were not
Not fundamentally stages of training that we haven't been familiar with with other model releases. There are a number of model versions that have been released from from deep-seek, so there's the deep-seek R1, the sort of flagship.
which is like 700 billion parameters or something. It's very large. You're going to need at least at full precision. You're going to need many, many GPUs to run this. I think Philip from Huggingface said like his
What he said was like 16X, 80 gigabyte GPUs, like 16X H100s to give you context. I think an H100, well, if you have it up all the time at on-demand pricing and a cloud is going to run you something like 60 to 80 grand a month, something like that. So you need 16 of those to run the model, the full model and full precision with at least NVIDIA GPUs.
then they've released other variants of this model. That full model is that mixture of experts model, which has the element of the external or third-party code added into it. They've also released distilled versions of that model. We can get into that here in a second, but just wanted to make clear what the main model looked like.
And these distilled versions of the model, if you go to hugging face, you can actually look at the collection from DeepSeek for DeepSeek R1. And what you'll see is a whole bunch of different DeepSeek models, which again is often a point of confusion with people, like what do all these things mean. So we've got DeepSeek R1
We have distill llama 70b distill quen 32b distill llama 8b et cetera. These, these ones are really significant for people to maybe understand. So these models are what's called dense models. So they don't have the mixture of experts element. They run all of their parameters all the time and they're distilled models. So what DeepSeek did is they took
their flagship deep-seek R1. They created their flagship model, and then they used the process of knowledge distillation to create smaller versions of that model that leverage the power of the larger model in the following way. We've talked about this on the show before, especially we had a
We had an episode with new research. They talked about this a lot. If you're curious, go and check that out. But you essentially use the larger model to generate a bunch of example outputs. And then you use those inputs and outputs that were generated by the larger model to then train a smaller model.
with very, very high quality data that boosts the performance of the smaller model beyond what you could get from a smaller model just from training it by scratch. So when it says deep-seek R1 distill llama 70B, it's a distilled version of llama 70B,
that use deep-seq R1 to generate that synthetic data for the fine-tuning process, which is great because they have versions of this model between 1.5 billion up to 70 billion.
It's great because actually, you know, especially the smaller ones, you could run it on your laptop. Certainly the kind of sweet spot ones around that 8 billion to 32 billion, those would take a card or a couple of cards you could run them on. And so this gives more accessibility to these models. And of course, people have already proliferated from there with various other optimizations like GGUF and other ones that can run on, you know, your MacBook.
processors and that sort of thing. So just to clarify the kind of ecosystem around the models there, those are a couple things to keep in mind. Yeah, that's useful to know. So a couple of big questions as we are starting to wind up here, to address, to pull out of the technical a little bit for a moment and address the ecosystem at large, the AI community,
What would you predict that deep seek now being here and how things wash out? Obviously there's the market reacted. I think they took half a trillion dollars away from Nvidia plus another half a trillion. Don't they have a lot of trillions of dollars? They have a few. They have a few.
And so, but that will probably stabilize out going here. What do you think that we're looking at over the months ahead? Beyond the day of market reacting today kind of thing, and we're talking about, but as you look at six months out, nine months out, that kind of thing, what do you think the real impact of DeepSeq is going to be on the larger AI global community?
Well, we've been saying it for a while, but I think the wider business community has not realized this. And one of the kind of fallouts from this or the things that will shift, I think, are really people taking seriously that in the future. Part of what your business needs to consider is model optionality, right? Sort of these GPT models kind of working for a long time. But now, like,
You know, apparently if you got $5 million sitting around, you could create a best in class model. And you know, what does that mean? It did means all of these are going to proliferate very quickly. This is not the last of these types of models we will see. They will proliferate very quickly. And you having kind of model lock in quote unquote, like you built all of your AI functionality around this particular model, whether that be open or closed.
That's not going to work out great for you in the long run, just because of the models changing. And so building in this ability to swap models, to have control and configurability, I think that's one of the kind of trends there. The other one I would say is now that
you're considering bringing these models into your own infrastructure, like there's parity with OpenAI in many respects. It brings up all of these questions that we were immediately prompted with, right? Like, well, if you bring that into your environment,
What are the security concerns related to that? How can you run it robustly and reliably? What should you be monitoring in production? It brings up all of these additional questions, which I think overall will be really good for people to consider because they're probably things they should have been considering for the past year as so many things were built on one model family.
So, yeah, those are a couple of thoughts. I don't know if you have additional ones. I've been speculating on whether as valuations have been going up and up and up for all these different AI startups all over, across the entire globe, and the budgets have just grown astronomically. Is this the moment where at this point investors looking at it going, why do you need 100 million?
Why don't we give you five million and see what you can do with it? Look what they did with it. Look what they did with it and stuff. Whether the cost of operations in AI startups is now going to affect. If that's the case, and with potentially not everybody being able to be quite so productive with their five million,
What does that mean? Are we going to go through a little bit of a cleanup round in the AI startup world or whatever? Any thoughts on that as we finish up?
Yeah, it's interesting. I mean, it definitely has an impact if you're kind of a model building type of startup, for sure. And we've seen other examples of this even last week, you know, hearing from Genmo and what they're doing with video generation models with a very small team and what they're able to accomplish. I think it definitely has implications on that side. I think it also has implications for those, it sort of
Hopefully, I think people will start thinking about less the kind of model building ventures, which will be interesting. But also, this is going to make model building and fine tuning even more accessible to kind of the enterprise, which will kind of fuel both tooling and infrastructure type of investments as well. Now, those, I don't think, will have the kind of
inflated, like, oh, I need $100 million to train a model sort of scenario, but yeah, we'll see how that shakes out. I think in enterprise world out there, maybe as a final thought on this, I think that you're now going to see existing budgets, which often for large companies are in the hundreds of millions of dollars where they're not AI specialty companies, but it matters enough to have a big budget.
The expectation that I've always used other models and we've built lots of infrastructure around that, maybe there's pressure now to actually go and work on specifically models for your business that you're creating because you have 5 million times X available to do that in the new way of thinking. So it may certainly change what the expectations are in enterprise world.
What I think will be stressed in that case is the data curation and human involvement in that process. There was clearly a big investment on that side here. Even if you spend $5 million on that,
You know, final training. There's definitely a process that goes into that. I agree. Yeah. Good conversation today. Yeah, definitely, Chris. Definitely check out some links that we'll put in the show notes to various articles about this, both on the technical and the kind of more hyped geopolitical stuff. So check it out. Thanks for joining again. Great to talk, Chris. Yep. Absolutely. See you next week.
All right, that is our show for this week. If you haven't checked out our changelog newsletter, head to changelog.com slash news. There you'll find 29 reasons. Yes, 29 reasons why you should subscribe. I'll tell you reason number 17, you might actually start looking forward to Mondays. Sounds like somebody's got a case of the Mondays.
28 more reasons are waiting for you at changelog.com slash news. Thanks again to our partners at flight.io to break master cylinder for the beats and to you for listening. That is all for now, but we'll talk to you again next time.
Was this transcript helpful?
Recent Episodes
Video generation with realistic motion

Practical AI: Machine Learning, Data Science
Discussion with Paras about Genmo, a company focusing on improving realistic motion in video generation models, as they battle common struggles with physics in video generation tools.
January 23, 2025
Mozart to Megadeath at CHRP

Practical AI: Machine Learning, Data Science
Jeff Smith, Founder and CEO of CHRP.ai, discusses how his company uses employee's music preferences to anonymously analyze emotional wellness data. This helps HR leaders make proactive decisions improving productivity, retention, and overall morale.
December 19, 2024
Sidekick is an AI Shopify expert

Practical AI: Machine Learning, Data Science
Chris discusses AI-powered Shopify features with Mike Tamir and Matt Colyer, focusing on Sidekick, a new AI commerce assistant to handle merchant's business needs within the Shopify ecosystem.
December 11, 2024
Full-duplex, real-time dialogue with Kyutai

Practical AI: Machine Learning, Data Science
Kyutai research lab's real-time speech-to-speech AI assistant, Moshi models, and future plans are discussed, with focus on small models and French AI ecosystem.
December 04, 2024

Ask this episodeAI Anything

Hi! You're chatting with Practical AI: Machine Learning, Data Science AI.
I can answer your questions from this episode and play episode clips relevant to your question.
You can ask a direct question or get started with below questions -
What was the main topic of the podcast episode?
Summarise the key points discussed in the episode?
Were there any notable quotes or insights from the speakers?
Which popular books were mentioned in this episode?
Were there any points particularly controversial or thought-provoking discussed in the episode?
Were any current events or trending topics addressed in the episode?
Sign In to save message history