Hello and welcome to this Fast Company Podcast, the AI revolution, why data storage is the hidden hero. I'm your host, Abigail Bassett. The AI revolution hinges on data storage, which is frequently overlooked. In fact, much of the AI out there is currently being built on old technology, hard drives.
It takes an estimated five hard drives to support just one GPU. And in a recent study by our partner Solidime, 57% of users, so that data storage is one of the biggest business challenges for AI. Solid state drives or SSDs could help reduce this load significantly by reducing storage bottlenecks and making AI pipelines faster, more efficient, and more future-proof. Joining me to discuss all this and more are a pair of experts in the space.
Avi Shetty is the Senior Director of AI Ecosystem and Partnerships at Solidime. And Sachin Gupta, he's the VP and General Manager of the Infrastructure and Solution Group at Google Cloud. Gentlemen, welcome. So Avi, I want to start with you. Let's talk about where we are right now when it comes to AI solutions and data storage. What are some of the most significant challenges that we're facing in this space?
Thank you Abigail for having me. Yeah, currently data storage, if you look across the industry, it's typically tiered into a big massive pool of hard drives with typically a small caching layer using SSDs in front of it to increase performance.
This approach typically works to a point. As we know, with AI workloads getting extremely complex, data sets continue to explode in size, and it's going to take an awful lot of HDDs which come in typically 24 or 30 terabyte density points to store all of this data. Switching to a high-dense cost-effective SSDs with capacities now reaching up to 122 terabytes, eliminate the need for this tiering and saves a lot of cost and adds to the overall TCO.
In effect, if the HDDs are charged stroke, you can replace between five to 10 SSDs with a single 122 terabyte, achieving massive performance increase at the same time. Current data solutions are also challenged at different stages of the AI pipeline. At ingest, where you're handling data ingest, where you're handling large volumes of diverse data structured or unstructured, you need a low latency high throughput device, which an SSD can help with.
Data preparation, where you're looking at data quality, consistency, the relevance when dealing with missing values, an outlier, and diverse data formats, again, requiring very random access, random heavy workloads, which SSDs are far superior than a hard drive.
And when it comes to exploration and analysis where you're examining the prepared data, understand its characteristics, identify patterns, and determine the most relevant features for model training. Again, a use case which requires sequential workloads, you see SSDs helping out. And Sachin, when we talk about AI, a lot of people think that we've sort of
moved the ball very quickly forward. What role does sort of Google Cloud and some of the storage solutions that you guys offer play in the space? And what are some of the challenges that you see developing as AI continues to move forward? So with AI, the requirements for storage are quite unique. So in terms of the massive amounts of data sets that you're dealing with into multi petabyte, in terms of the performance, the multi-terabyte per second of read write, the fast check pointing that you might need, all of that becomes super important.
But when you think about Cloud, the number one thing users come in and think about is, where can I get GPUs and TPUs? Because those tend to be scarce. They tend to be very costly. But then if your GPUs and TPUs are far away from where your data is, you're not going to get the performance you need. So we're very thoughtful about this with how do we help, for example, customers use something we call anywhere cash, where instead of you having to move your data manually, we can automatically move that
to be right next to you as close to your GPUs and TPUs as possible so that you get much lower latency and much higher throughput and actually much lower cost to be able to access the data to meet your needs. So that's just sort of one example of how we're innovating in cloud to address the newer needs of AI.
Avi, why do you think that data solutions have been so consistently overlooked in this current AI revolution? Yes. Development efforts and budgets are usually allocated to areas perceived to really improve model performance and accuracy. There is general perception that storage is stable, less prone to innovation and not directly contributing to model accuracy as it lacks the visibility relative to GPUs, host bandwidth memory, HBM,
which have a visible impact on both cost and performance. We have hyperscalers, essentially all their innovation budgets are focused on improving GPU networking at scale, training larger and larger model where memory is the focus and building their own accelerators to be price competitive. Startups, they're walking through the ecosystem startups and venture caps,
are also focused on building foundational models, custom accelerators, and very few venture-cap dollars flowing into the storage innovation for AI. Data storage is often overlooked because it tends to be proportionally much lower in terms of cost and energy consumption than GPUs.
But that would actually be a mistake. It's just like saying 49ers rush offense doesn't matter because they usually pass more. The coach would definitely disagree, right? Both are key to team success. And we've seen many examples of this where suboptimal storage can consume as much as third of AI clusters power, which a lot of people don't realize. We've referenced a couple of studies from our partners.
namely, Meta and Stanford released a joint white paper assessing performance of Meta's AI recommendation engine and found that HDDs consume up to 35 percent of available power during these stages and which directly constrained training capacity due to the fixed data center power budgets. A similar study in the same domain, Microsoft Azure and their university partner Carnegie Mellon,
Release the white paper which stated 33% of operational power consumption in the shores. General purpose cloud was related to storage and the conclusion which they called out was the best way to impact storage efficiency is to increase density per drive so that you have fewer of them.
As we see data storage get smarter and more local to your earlier point, what kind of implications will that have for people who train these models and use these models on a regular basis as a result of some of these new solid-state solutions? I think it's important for customers to decide based on their use case, what is the good put they're looking for, and truly understand what the options are that are available to them.
I'll give you just a few quick examples. One is most customers will try to use object storage. And Google Cloud Storage gives you massive scale object storage at a very low cost. But for their training job, they actually need a file API. The fact that we can support Fuse on top of Cloud Storage, giving you a file API. But the fact that we also support something called hierarchical namespaces so that the object storage actually has a more tree-like structure, more file-like.
That's an option that people may not know about, but it can give them a significant benefit when trying to do a training job. Another customer might say, you know what? I actually need extremely low latency storage. And so we were the founding member of Deos, which provides a parallel file system where we can provide, again, high scale, extremely low latency parallel file system for the jobs that require specifically that.
And then one of the point I just wanted to make is that it is of very high importance to make sure that the GPUs and TPUs that you do have are fully utilized. And so if that GPU or TPU is waiting on storage and is at 15, 20, 35, 50% lower effective throughput, that's a problem because you're just wasting money at that point. We have a lot of great solutions. Like for example, through our block storage, we provide something called Hyperdisc ML.
I mean, sounds really simple, but it makes it really quick to load a model into a GPU for inferencing.
So imagine you have to bring up 1,000 GPUs with the latest model to start serving, to start inferencing. Leading them idle is actually not good for you. Really understanding the landscape of options that are available. Are you training? Are you inferencing? Do you need extremely low latency? What do you need? And then mapping that to the right solution can have significant power, significant cost, significant performance benefits for our customers.
And really, what we're talking about is sort of an infrastructure issue, right? Making sure that our data storage is close to the source or the right type of data storage for whatever training or activities the AI models are used to do. Avi, how has infrastructure really needed to change or be rethought in this new space? How do we need to scale our infrastructure in order to meet these new data demands?
For the next generation of functionality AI models are relying on bigger and bigger data set right and this create essentially two problems, which is essentially performance and density, which SSDs are in a position to solve.
high-performance storage, low latency storage is needed to ensure, as I think Sachin rightfully said, you need to keep your GPUs busy during those long training cycles. And on the other end, high density means that all else being equal. You're using fewer devices so that you save one power, rack space, cooling requirements, and others. The new shared everything storage architecture allows users to expand their storage infinitely and adjust their performance dynamically as required.
These architecture require very, very high dense storage devices, such as the one solid I'm just released, our latest 122 terabyte QLC SSDs in a single form factor, which allows users to squeeze way more data in a smaller space while ensuring throughput to the GPU and as well keeping the power low.
But storage innovations are not enough is where I feel in terms of infrastructure revolution. Along with storage innovations, the industry do need cross-technology collaborations and partnerships as AI systems definitely bank on two major vectors. You either scale up or you scale out.
and new open interconnect standards that are emerging like UAL for scale up, ultra Ethernet for scale out, and it's key that industry partnership amongst networking, memory, compute, all partner along with storage manufacturers in here. Companies are facing in inefficiencies within their GPU cluster, often not fully utilizing the potential and leaving significant resources underused memory. However, on the other hand, I've kind of plateaued with limited innovations in DRAM technology.
While HBM stacking is awesome, it's great. It provides excellent performance benefits. It comes with a challenge of yield drops that increase with every additional stack, which in turn could drive up the cost of HBM. Hence, storage innovation, especially high bandwidth, high dense SSDs that allow you to scale density while also offering high bandwidth with low latency help to improve GPU cluster utilization and offer a much, much better TCO for large scale AI deployment.
And Sachin, how does, in your opinion, infrastructure have to scale up to meet our growing needs for AI and the unique demands of these models that increasingly people are relying on for business and research in any number of industries? I think we've already talked about massive scale. That's only increasing. We've talked about massive throughput.
cost management, I think is key as part of that, and picking the right options is important. I'll actually touch on two things that are kind of on top of infrastructure, especially when you think of this at cloud scale delivered as a service. So two things are, one is how do you think about storage management for AI? If you've got version one of data that you've used to train a model, and then you've got version two of data that you want to train another model on, how do you manage that?
when you have these massive data sets. And how can we use AI to actually help you manage storage by extracting the metadata, helping you set policies on that storage, helping you automatically move something to colder storage if you're not accessing it frequently. So you save on cost. So I think
I would not understand the value of the storage management capabilities and the advancements there to support the AI use cases. The second thing that I think is also super important is to do this responsibly, you need to think about security. These data sets that you have
Who has access to these now? Who owns the encryption keys? Do they sit in the country that you're in? Do you have data or residency? And so we take sort of great care to make sure that customers have sovereign controls over their data, over where their storage resides, so that as they're building those AI applications, training the models or inferencing,
They can be assured that all of their policies have been met. They are fully compliant. They are secure. They have met their sovereignty goals. And so these two things, you know, like sometimes you're so focused on performance and scale and cost that management and security sometimes get overlooked. But for us, from a cloud perspective, and for our customers, both of those are also equally important.
Yeah, absolutely, that actually was gonna be my next question to Avi about sort of how this new data revolution and storage revolution can really help enhance the security and safety of the data that we're continuing to share and create as these models are trained and built. Of course, the data security for us as storage vendors is of paramount importance. We follow all the best practices for security as defined in industry specifications.
There are multiple levels of security, including encryption, drive locking that ensure data remains protected. Solidime works with many industry standards, but including TCG, which is the trusted computing group, OCP, which is the open compute project.
to not only write standards, but also to make sure they're implementable and deployable at scale. We currently support AES, Opel, as well as we're working on new specifications like KPIO, which is KeeperIO, which will eventually replace Opel, which is a self-encrypting kind of drive SEDs, and also working on a new specification partnering with OCP called OCP Calibtra for root of trust.
These are things which are within the SSD. But as far as the ecosystem, we're also seeing more and more customers deploy Edge AI solutions for reasons of data security and privacy. And I have an analogy. I used to be part of the personal computing laptop and desktop group. And we had a survey done. This was a decade back.
We're talking PCs and professional laptops and commercial environments. Even though you had a keyboard issue or a screen issue, the first thing your IT guy would do is remove your data drive from the system. And then send the laptop to whichever OEM for fixing. And that essentially meant data does not leave private secured IP data, did not leave your premises.
And you're seeing a similar kind of trend happening with AI as well. We're seeing more and more edge AI solutions for the same reasons. And from a broad perspective, we can't talk about data storage without talking about power management and usage. Avi, what role does that play in sort of the future of AI, especially when we talk about the power grid and power delivery?
the most important one. I think it's the elephant in the room. Power management and power grid absolutely play a critical role. We've seen numerous public announcements from Amazon, Microsoft, Google and their strategic shift towards nuclear energy as part of their broader commitment to sustainability and energy efficiency. Signaling a potential change in how our industry addresses its energy needs.
While we try to solve the power grid, I've been part of many conferences recently and what you could see is an emerging trend of cooling technologies which are essential for AI deployments in particularly in data centers where high performance computing generates significant, you know, gone other days where air conditioning, HVAC, and just cooling air in your data center was enough. We are not talking liquid cooling.
You know, it involves circulating a coolant directly on the hot components like GPU and CPUs. We are now looking at immersion cooling where the entire server is placed in a tank filled with non-conductive liquid that absorbs heat. My point here being power is the most important topic in AI infrastructure and very critical in any component, any decision here. Quoting Sam Altman from OpenAI, where he talked about distinctly remembers his code.
He talks about there are a lot of parts of AI that are hard. Energy is the hardest one and that kind of summarizes this whole problem. Power is one of the most critical and storage with us offering SSDs and its efficient power helps in deploying large scale AI infrastructure with reduced power and overall TCO compared to hard drives.
And Sachin, I know Avi mentioned that Google has talked about nuclear power and some of the ways to get more power for these models to continue to meet the needs. From your perspective, what role does building out the power grid and improving its reliability, improving its power delivery really have in the future of AI? I think all of the items you mentioned are important.
There's many, many aspects to this. So Google has been building data centers for more than two decades now. We've taken our experience to really optimize for energy efficiency. So we, in fact, believe that Google data centers are 1.8 times more energy efficient than a typical data center. And so there's a lot of sort of science that goes into how you build. I mean, we've been doing liquid cooling for many, many years. We're on our sixth generation of TPUs, for example. I think, coupled with that, all of the investments
like it could be solar wind, it could be nuclear, working with grid providers so that there's reliability, there's growth, looking at where are the locations we should be investing, all of that becomes important. We're committed to our very strong and I would say leading sustainability goals. I think it was
2017, when we said we're going to match 100%, electricity consumption from our operations with renewable energy purchases, and we have this sort of ambitious goal that by 2030, we are carbon free, 24 by 7. There's a ton of proactive investments that we make, and we work with the community, we work with governments in order to be able to continue to maintain our focus and sustainability.
And one of the things that we also should talk about, especially as we talk about data centers and storage, is of course the economic impact that these data storage centers can have on a local community. Sachin, how does Google approach this rubric and this question of the local economy in which you guys decide to build these data storage centers?
This benefits that we are able to drive that from the very bottom are just the people you need to bring in to build the data centers, to build the power to operate these data centers. But actually, a huge amount of the impact is by the AI solutions.
the data solutions, the security solutions that you're able to enable for that community, for that country, for that overall region. And that is typically the billions or tens of billions of impact. And so normally, with our data center investment, it's about enable, but all of those services
It's about partnering with local government on things like training, on things like optimization for local language support with our AI solutions. And so it goes well beyond just the people you need for the construction and operations. It's about impacting the overall economy and all of the people in that region.
Right, so it's well beyond just the borders of where that data center is located. And Avi, how do you guys think about sort of that economic component when you think about offering solid estate data solutions for AI?
It's all for us. It's all about efficiency. Data center requires the efficient infrastructure. We believe efficient infrastructure comes with efficient storage. Efficient storage comes with solidine. We are focused on providing high-dense SSDs to our customers and partners
which allow for efficient power consumption in terms of the storage components, as well as contributing to the power efficiency of the overall infrastructure and the overall TCO when it comes to legacy storage solutions.
You know, AI is such a hot topic of conversation these days, and it's been really revealing to learn how crucial data storage considerations are, especially when it comes to this fast, advancing tech. I want to thank both our partner, Solidime and my guest, Sachin and Avi, for such a lively discussion. On behalf of Fast Company, thank you so much for joining us today. Thank you, Abigail. Thank you.