Hello, everybody, and welcome to the Stack Overflow podcast, a place to talk to all things software and technology. I'm Ryan Donovan, your host for this episode. And today we are pondering the question, how are AI apps like Google search? It's a mystery and our six for the question today is Daniel Loreto of Jetify. Hi, Daniel. How are you today? Hi, Ryan. I'm doing great. Thank you for having me.
So at the beginning of every episode, we like to get to know our guests, see how they got into software and technology. So it can tell us a little bit about how you got to where you are today.
I don't know how back you want me to go. I guess as a kid, I was always interested in technology. You know, my dad brought like a PC home, like an IBM PC. I'm all down, so he'll floppy this, no hard drive all that gonna stop. Sure. But I just got really interested in it, and that's how I got into tech. And anyway, since then, I went to college to study computer science, ended up working at Google.
I worked at Airbnb, I worked at a few startups, I worked at Twitter, so I've been through several big technology companies and several small ones as well. And then you found it in Jetify. When was that? Yeah, it's about three years ago, we found it in Jetify.
high level, we've been wanting to simplify what it takes to develop applications on the cloud. We've been focused on developer environments so far, but we have some products coming soon that are all about AI agents that help with software development and in many ways drive those developer environments to do tasks on behalf of developers.
So I'm sure a lot of folks are wondering how AI apps are similar to Google search, you know, sort of offhandedly, I would say they're both pretty opaque to the user, a lot of rain making and hand waving. But can you tell us where you're coming from on that?
Yeah, absolutely. I worked on Google Search. This is back in 2005, 2006. At the time, LLMs didn't exist. The learning wasn't a thing. The Google Search was mostly, I'd say, very sophisticated but rule-based system. But what I see as the common thread between the work we used to do there at Search quality and AI applications today is that there are data-driven systems
where it's very hard to predict the behavior just by thinking through the logic that you wrote. And if you think about AI and LLMs, there's a lot of non-determinism. You might have a very open-ended set of inputs, right? If you're aligning your users to say whatever, and then the LLM can respond in many different ways to those inputs.
It becomes very hard to predict the behavior. And so I think you need certain tools and processes to really ensure that those systems behave the way you want to. I feel like they're both similar and that there's a whole industry of folks trying to sort of explain and understand it, you know, with the SEO gurus, with the Google search.
and the prompt engineers with AI. Do you think there's a way that we can make AI apps a little less opaque for the developer understanding and testing and getting a predictable results from your AI app? Yeah, I think there's a few different techniques that can be used that I think people should be adopting when developing AI software. Maybe the first one I'll start with is that you have to look at the data. And what that means to me is that first you need to kind of store somewhere
you know, traces of how the system responds to different inputs. So, you know, your user might type something and internally, let's say you're building an AI agent or a workflow that's taking several steps. And throughout those steps, it's creating certain prompts that talk to an LLM. I think you need a system of record where you can see for any session exactly what did the end user type.
and then exactly what was the prompt that your system internally created, exactly what did the LLM respond to that prompt, and so on, for each step of the system or the workflow, so that you can get in the habit of really looking at the data that is flowing and the steps that are being taken, as opposed to just the absolute end result and the beginning and not understanding kind of what's happening in between.
So when you were at Google Search, how did you tackle this problem of gathering the data and then understanding it? Like you said, it's very hard to think through the logic of this, just looking at the code you yourself have written. So we actually had a few different tools that we used. So the first one was, if you worked at Google Search on any search page, you could add a little CGI parameter to the URL.
and now below any result on that search page, you could see debugging information explaining exactly what steps the ranking algorithm had taken and what the scores were and why it had decided to place that result there. So this comes back to being able to understand what the system did.
Another tool we had was the ability to stand up two versions of the system, one with a new change and one without the change. And then automatically run a ton of search queries against both systems and kind of automatically diff the search results that option A would generate and option B would generate. And then you will be able to inspect those changes. So you can now see all
You know, this is how my change actually changed search results. Yell against this query. All right, you can go through all those different queries and kind of see for yourself. And then because you had that debugging data that I mentioned before, if there was something surprising, you could then kind of dive deeper there and try to understand what happened. Right. And so I imagine with AI systems, one can do the same where you're kind of playing these little diffs or experiments as you're iterating on the system.
to try to understand, hey, if I replay my end user inputs, but against version one of the system, against version two of the system, how do my outputs change? And can I kind of inspect those and kind of really get into the details and understand whether it's better or worse? Did you ever run into anything that was really surprising in the search results? And you think where you're like, that's not supposed to be there?
Yeah, totally. I mean, that will happen all the time. I don't think it was my change, but I remember somebody in the team working on this. They were working on image search. And so the idea came up, which I think was already used for web search results, that, hey, if somebody's clicking on an image,
That's probably a better image than the ones that are not getting clicked on. So if you think about it, oh, somebody search for something. They're seeing 20 images. They're probably clicking on the ones they want, right? It makes sense. And so why don't we collect that click data and then kind of feed it back to the search system and boost
those images that are getting more clicks. And when you try that, it mostly works. But later, as we roll out and learn more about the data, it turns out that sometimes if you have a surprising image,
So let's say, for example, yeah, I'm making this up a little search research for images of dogs and something with kind of really badly at the beginning of the algorithm and they found a hot dog, right? And so now you're presenting 20 images, 19 are actual dogs.
One is a hot dog. The aspect of that image being surprising might actually cause the user to also click on it too. It's like, wait, wait, wait, why is this here? I can understand and sit there might click on it. And so all of a sudden, that's a case where you have these kind of, I guess, reinforcement loop of the signal posting the wrong thing. Right. The unintended consequences, right? Yeah, exactly, right.
So I know a lot of people for AI apps are trying to dig into things like explainability and to kind of trace the reasoning with, you know, a rule-based algorithm like Google search, it's easier to say, here's why this was chosen. Are there ways to get that out of this sort of massive list of parameters, whatever is in the LLMs? Yeah, I think what you're pointing out just to kind of say it more explicitly is, you know, kind of a rule-based system is the rules are deterministic, kind of like,
The non-determinism comes mainly from the quantity of data that goes through the system, but in the case of an LLM, the LLM itself is probabilistic, non-deterministic, it can give different responses. In fact, over time, it might drift. If you're relying on an external model,
which I'm sure many people are, that model is not controlled by you. It might be improving over time, but it might change. To me, that means that one is to be constantly monitoring some measure of quality, and then that you need to couple those measurements with, again, the debugability and maybe even human judgments that you're doing frequently to understand if the system continues to behave the way you intend.
The other thing I'll say is that non-determinism means that if you are creating a workflow that has several non-deterministic steps, and each of those non-deterministic steps has a probability of producing an erroneous output.
Well, the error compounds, right, because it's essentially multiplying, right, the error rate across the different steps. And so I tend to think that you will also want to kind of control the quality of each step in that workflow. And maybe you need some sort of feedback loop to
kind of control the precision or the quality of the output that you're getting. It depends on the application, so I can't be too specific. Some applications might allow for more errors that are presented to the user and the user's fine. Like, oh, whatever, you're presenting five options, they get to choose. But other applications might need much higher precision. And so depending on the application, you'll need to control those error rates a lot more.
Yeah, in some ways, I think the non-determinism is a feature and not a bug of LLMs, right? That's a surprise that they give you is something that is almost valuable about it, right?
Yeah, no, I agree. I agree. In fact, as we've been developing some agents ourselves, I often enjoy seeing, like, I might run a workflow four times because I actually kind of want to see the variety of outputs. And then I end up improving my initial prompt based on what I liked about the different outputs.
So it's a little exploratory at first, like my prompt might be a little bit big, like do this. And I'm just like, okay, let's see what you do. Right. It's 10 different times. And then as I see the output, I was like, oh, I really like how you structured the data here. Or I really like how you mentioned this aspect here. I might then incorporate
those requests into my prompt and then start to reduce the variability of the output. Like you said, that initial non-determinism actually helps me kind of explore the possibility of outputs. For some applications, for some uses, like you said, that non-determinism helps you as the user figure things out.
but for a developer who doesn't want such a distribution of responses, what can they do to narrow down how an LLM responds? I think there's a few tools. Obviously, if there are cases where you don't need the LLM and you can do traditional logic in one of the steps of the workflow, you should do that where you do need an LLM or a machine learning model.
I think you have new choices. One is you can be more precise in your prompting. And that's kind of what I was describing before. You can start vague and you're going to get a lot of variability in the output. But as you learn exactly what you want, you can make that prompt more precise and kind of reduce the space of responses that the LLM will produce. If you have a fairly specific use case, eventually, I think you can train
your own model or fine tune your own model or like that application. A lot of the variability in my head comes from having a general model, right? And so it's like a double edged sword. They let you solve all sorts of problems, but also they might go in more different directions than a more specific model.
Yeah, so if you need the specific use cases, you've got to fine-tune it to that use case. Right. I think I've identified we're still playing with this, but sometimes you can have two systems that can kind of help correct each other. So for example, let's just say in the case of software development, which is a lot of what we're thinking about.
So you're generating some code via the LLM. In that vertical, you have the ability to try to compile the code. You have the ability to try to run the code. And so you might do that as an extra step in the workflow. And if you detect an error from the compiler, feed that back and try to correct, right? And so now you have essentially LLM and compiler both kind of feeding off each other and try to generate something more accurate.
There's another sort of similarity between search and LLMs. I saw a sort of jokey meme Venn diagram recently that said LLMs are just technically a slow database. And search is essentially a very large database too. Is an LLM just a sort of unpredictable search?
I mean, there's some analogies, but I think fundamentally it is a different kind of system. Like a search in my head is, beyond the scenes, you have an index of keywords that you're kind of looping through and then finding the documents that match. Whereas an LLM, you know, there's kind of like that neural network and it's a lot more probabilistic. I suppose where the analogy makes sense is in terms of trying to
get the output out of the system, there's that aspect of like, are you crossing the right query? So there, I think there's an analogy that makes sense, right? Like, am I typing the right keywords on Google so I can find the right page? And then am I prompting the LLM the right way so that it can produce the output that I want? Yeah, that's where I think the analogy makes sense.
Yeah. I mean, you could look at it as a sort of bafflingly indexed database. I guess that's fair. Yeah. Yeah. Yeah. So for folks who want to create some LLM's at scale, what are the lessons they can take from your search experience?
the practices that you wanna adopt to develop these types of systems is I already mentioned the explainability. Two, I do think you want to collect the data set that helps you evaluate the performance of the system. And here when I say performance, I'm not talking about the speed, I'm talking about the quality of the results. Like is it producing the output you expect or not? I think that often involves human
in the loop, not necessarily at the time that the system's creating the response, but at least after the fact to judge.
where there are certain cases that the system is trying to handle where handled correctly or not. And so I think you need to save those data sets. And later you can use those data sets either for training a new model or simply to continue measuring the performance of your current system and ensuring that the precision and recall like the quality of the system is to the standard that you want it to be.
So, I hear a fair amount about the sort of governance and safety things, but kind of a dozen add-on. Do you think things like de-biasing, preventing, you know, malicious responses, privacy shields? Do you think those should be included as first-party concerns when you're training up in LLM, or are they things like once you have this sort of use case, UI fit?
something you should think about. I mean, to me, I think it all goes back to the use case you're trying to solve. Yeah, but depending on your use case, you should really care about the safety and possible responses of your LLM. And I think it'd be smart as another best practice to have a layer or layers in your own system where you can decide to filter out certain responses and not move forward with them.
Definitely want to avoid profanity, depending on the use case, profanity, violence, etc. But then, I don't know, you're doing something financial with the LLM. And then there's just some sort of suggestions you think would be very financially dangerous, right? Like you might want to think about how you monitor for those and maybe even limit those when you don't have enough confidence in the output.
It's interesting you bring up layers. Do you think it's a sort of requirement this day if you're making an AI application to have multiple LLMs checking responses prompts or is it okay to just sort of go freewheeling and have one prompt, one response?
I don't think it's better to try to divide the system into smaller components that you can then measure test and refine individually. I mean, if you're doing something that just requires a very simple prompt,
maybe your advantage there is around the UX and stuff that you're building, but it's not really the AI because you can just go to chat GPT directly or cloud directly and get the same result. If you're doing more complicated workflow, which I think is where you start gating, for a particular vertical, you're creating additional value over what you might get from just using chat GPT directly, then I think you'll want to
going to layer the system. And by the way, layers might mean other LLMs, but it could also mean other deterministic pieces in the system that are interacting and or driving your behavior.
of the LLM, right? The rule base AI is not dead, right? Right. I think at the end, kind of the interesting AI systems are going to be kind of like compound, right? Like it seems, you're going to have some deterministic things, you have some rules, you're going to have LLMs, and you cannot use the right tool at the right time to drive an overall workflow.
Right. So what are you most excited about either working on or tackling? What are the challenges and things for the future that you're excited about? I'm just very excited in general about AI agents personally. We've seen the promise of LLMs and how they can generate English and images and audio. I get extra excited when I start seeing those LLMs
coupled with tools that they can drive. The computer use example from Anthropic and I think all sorts of APIs and other actions can be built into systems that are essentially have an LLM driving it and then you have those tools allowing it to do input and output with the quote unquote external world.
I'm super excited about that. In fact, a lot of what we are playing with is exactly that. So right now we have kind of private beta with a few customers, a fully automated QA engineer, which is really an AI that goes through a web application. It's kind of clicking around, forming testing plans,
deciding if it's seeing bugs and then creating reports out of those things. And so all of a sudden you have this kind of wrote manual word that people hate doing in my opinion. Like I've never met an engineer who loves having to click through their application 50 times in all the different edge cases that we can automate. And then that allows developers to focus on the cool part of software development on the creative part, the part that they enjoy the most.
Yeah, I mean, I see a lot of people talking about click-ups and trying to automate that away. I've also heard that there's something lost by not interacting with those edge cases. Do you agree? I agree, but in a certain sense, I don't think the value comes from the developers doing the human labor. I think that value comes from observing with your own eyes,
kind of where the UI broke or where the feature broke and what was confusing about it. And so in this example that I'm mentioning, we're actually making the AI create recorded video of the steps that it takes. And so my hypothesis we'll see is if we can still show those videos to developers to explain
Those areas where we feel the application broke that they still will get that kind of insight into their own application by observing and not necessarily by being the ones moving the mouse.
Thank you very much, ladies and gentlemen, for listening today. As always, we're going to shout out a winner of a badge. And today we have a first-life book badge to shout out. Congrats to Davo Samaria for providing an answer to the difference between pushing a Docker image and installing a Helm image. If you are curious about that, we have a solid answer for you, and it's one of badge.
I am Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at StackOverflow.blog. And if you want to reach out to me, you can find me on LinkedIn. I'm Daniel Loreto, CEO and founder of Jettify. You can check everything we're working on at our website, jettify.com. And you can find me on LinkedIn as well if you'd like to connect. All right. Thank you very much. And we'll talk to you next time.