I have to tell you about the new hit podcast. It's about any topic you can possibly imagine. The hosts seem to know just about everything.
Their banter is warm and engaging. Frankly, they sound like people I'd love to hang out with. But I can't because they're not real. Does it twist the hosts aren't even human? Yeah, it's pretty wild what's happening here. The duo you just heard hosts deep dive an experimental audio feature released by Google this September. And the hosts, their voices, and everything they're saying is AI generated.
To create a deep dive episode, the user drags a file, drops a link, or just dumps text into a free tool called Notebook LM. They can input up to 50 PDFs, web pages, word docs, or really any piece of information. Notebook LM then takes that information and makes it into an entertaining, accessible conversation. Those conversations are fairly accurate, but they're not without the hallucinations and misinterpretations many of us have come to expect from generative AI tools.
Google calls the resulting clip of around 10 minutes an audio overview. You might just call it a podcast. And I can't stop listening.
I'm not the only one. From students in need of a study aid to AI titans of Silicon Valley, the platform is gaining popularity. Andrei Karpathy, a deep learning expert who co-founded OpenAI and led Tesla's team working on computer vision for Autopilot, wrote on X, deep dive is now my favorite podcast. The more I listen, the more I feel like I'm becoming friends with the hosts. And I think this is the first time I've actually viscerally liked an AI, two AIs.
So, could it soon be your favorite podcast? Aside from this one, of course. From the Wall Street Journal, this is the science of success. A look at how today's successes could lead to tomorrow's innovations. I'm Ben Cohen. I write a column for the journal about how people, ideas, and teams work and when they thrive. This week, we're diving into Google's deep dive, how it works, and when it doesn't.
The exchange you heard at the top of the show came from dropping my recent column on this new Google audio feature into Notebook LM. But you can make podcasts out of pretty much anything. Wikipedia pages, YouTube clips, random PDFs, your college thesis, your notes from that business meeting last month, your grandma's lasagna recipe, your resume, your credit card bill. I listened to an entire podcast about my 401k. As one of the deep dive hosts put it,
It's pretty amazing what people are doing with it.
But there are other reasons you might want to turn information into an audio conversation. Maybe you're an auditory learner and prefer listening to reading. Maybe you find podcasts more engaging than pitch decks. To dig into how it works, the potential pitfalls of those convincing AI voices and why you still can't trust everything they say, I spoke with Deepa C. Theramond who covers AI for the journal. That's after the break.
How does notebook LM take whatever you feed it and churn out an entertaining 10-minute chat show? To answer this question and look to the future of Deep Dive's AI voices, I spoke with Wall Street Journal AI reporter, Deepa C. Theraman. Hi, Deepa. Hey, Ben, how are you? I'm OK. I feel like we are now the hosts of Deep Dive podcast. We sound like it.
Okay, so let's get into how this works. I spoke with some people over at Google Labs, which is a division within the tech giant that incubates and builds these AI-powered products like Notebook. They explained to me that when you upload your source material, which is just like a link or a PDF or a document or anything you want it to be, Notebook instantly digests it and spits out a basic written summary. What does digests mean in this context? Like what is the tech actually doing?
Large language models are basically prediction machines, so some people call them fancy autocomplete, where they run tons and tons of calculations to predict what the next most likely word is. So if you type in Ben into Google, obviously the most likely next word would be
Cohen, right? Yeah, totally. I mean, that's obvious. That's how it produces these long sentences and these huge blocks of text that sound like humans wrote it because they've analyzed a whole lot of human text and figured out what is most likely to come in certain contexts. So then you could see how that same technical skill can be used for things like
Summarization. And so there's a lot of different types of summarization. There's extractive summarization where it identifies specific sentences and phrases from the original text that seem really important. And then abstractive summarization, where it generates ideas and phrases and sentences that capture the ideas that are presented there. But either way, all of this is these models underneath these products.
taking the text, summarizing and distilling it and making it shorter and more snappier and also more digestible. For deep dive specifically, the Google Labs people told me that under the hood, there's this model that is writing and constantly editing a script for the conversation based on the goal of being entertaining and bringing out insight. So not just summarizing, but specifically pinpointing the stuff that is most interesting and most surprising. How does that work? Is it bringing out new thoughts here?
It's kind of, I mean, that's sort of what this abstractive summarization idea is about. It's looking at the text and identifying what are things that humans typically find very interesting about this text and then bringing that out and then using math, again, to figure out the best way to actually take that text and summarize it and think about it. And I use think in air quotes, which you can't see on a podcast, but I'm doing that.
It's just layers of math being used to figure out what the most likely and the most liked version of the sentence and these sort of summaries would be. And sometimes it's using the language that you would see in the text itself. And sometimes it's using, you know, other language that you never appeared.
Yeah, but it can also make mistakes. We've all heard and maybe seen these hallucinations that LLMs have managed to produce and it is capable of bias and drawing connections and inferences that don't work or maybe don't exist or don't make sense. This particular product seems to be better at that than other viral AI products, in part because it's based on the source material alone. But why is it still making mistakes?
It's going to be hard to completely avoid mistakes in any large language model based product. It's never going to be a zero. And that's partly because at its base, these large language models will produce what it thinks is the most probable answer, not the correct answer.
So it's an answer that sounds right, but maybe isn't right. And there are things you can do on top. It's grounded in a specific text that can definitely cut hallucinations, but it never. I mean, research has shown this at time and time again. It never actually gets rid of it 100% and it could still happen because it's still operating in that principle. It's not really trained to understand like what the right thing is. It's trained to think about the most probable thing.
Okay, let's talk voices. They are not perfect, but they're really close, like really close. What can make an AI voice sound human? Human voices. I mean, really studying human voices again and again. It's a volume game. A lot of these models are trained on gigantic purposes. It's
just listening to human voices over time. It starts to sound a little bit more human. You mentioned it's not perfect, it's definitely not. There are certain things that humans do, like pause.
at awkward moments or, you know, like have just, uh, and say a lot of arms and start and then restart sentences. Basically the way I'm speaking this entire time is a very human way of speaking because it's sort of imperfect and you don't see that so much with these voices. They don't say sort of as much. They don't say like as much. They start us a thought and end it all at once, which
I don't think a lot of humans do, right? Totally. So the Google Labs people told me that they actually programmed all of those umbs and pauses and likes into the audio conversations from deep dive. Yeah, because they understand that like the human ear has evolved over the years to expect those things, right? If the voices spoke like computers in these perfect sentences, nobody wants to listen to that, right? And so in order to make it
more listenable. They also had to increase the noise as it's called, right? And so there's actually a word for this. They say it's called disfluencies. You need more of these disfluencies in these conversations to actually make them sound like conversations, like in order to make it better, you have to make it slightly worse.
It's interesting. You mentioned noise too. There's something about the frequency, which an AI voice speaks that's different from the way humans speak. So there are certain techniques you can use to, even if you and I don't hear it, software can hear it and they can tell the difference between human and AI. But even those kinds of detection mechanisms are getting harder and harder to differentiate.
I fed notebook, my own piece about deep dive. And there's this part where they are talking about it and saying that Google has managed to make the AI voices sound human. And that part goes like this. They really figured out how to make AI sound human. And that makes the whole experience so much more enjoyable. I mean, you kind of get drawn into the conversation like you're hanging out with friends or something. But let's be real, this technology can't be perfect.
Wow, weird. So there's something super uncanny here that makes me a little bit uncomfortable, but also like really fascinated. What is your reaction? There's something to that. There's a lot of AI ethicists that talk about this. And one of the main.
principles of all these AI products is supposed to be, you know, you're talking to an AI system, like it's not supposed to cosplay. It's not supposed to be human or pretend to be human. And so that little moment is interesting because it it's both
weird, right? So it kind of sounds not human, but also it's it's pretending not to be human. And that I think straddles that line between disclosure and hiding behind
something. How would you want this system to handle this interaction in the future? All of this is pretty mind-blowing. You are in this every day. I come in every now and then and have my mind blown by it. But what does this mean? Is this a toy? Is it a peek into the future? Is it somewhere in between? How are you thinking about this? Sure. I think he'd be both. I think he'd be a toy and a peek into the future and a warning sign.
Before I covered AI, I covered social media for eight years. And I think that when you look at the lessons of social media, which are, it's incredible. It's a great way to connect with people that you don't normally talk to. There's a lot of good that can come from that. And then there's
also quite a bit of bad, right? We've also seen that globally in terms of disinformation and all kinds of things that make you kind of wonder to what extent is this broader project worth it. And I think AI has a really similar potential in the sense that it is really useful and like we're enjoying it.
But we're also going to have to navigate the future where we're developing relationships with basically math, like software systems. And we're going to have to figure out what the guidelines are for those kinds of relationships. It just creates a whole flurry of fascinating questions that I do think are very much a part of the future. But yes, it is also a toy that's fun to play with.
Deepa, it was delightful to have a conversation with you and actual human being. Thank you for making us smarter about this. Thank you for having me.
And that's the science of success. This episode was produced by Charlotte Gartenberg, Michael LaValle, and Jessica Fenton wrote our theme music. We had help from Danny Lewis. I'm actually Ben Cohen. Be sure to check out my column on WSJ.com. And if you like the show, tell your friends, and leave us a five-star review on your favorite platform. Thanks for listening.