scikit-learn & data science you own
en-us
November 19, 2024
TLDR: Discusses scikit-learn, a widely used data science tool for creating classifiers, time series analyzers, and dimensionality reducers; 'probabl' is stewarding this project along with other open source initiatives; Yann Lechelle and Guillaume Lemaitre share insights on the company's vision for scikit-learn's future.
In this episode of Practical AI, hosts Daniel Weitnack and Chris Benson delve into the world of data science with Yann Lechelle and Guillaume Lemaitre from Probable, focusing on the beloved open-source library, Scikit-learn. With the rise of Generative AI, discussions around traditional data science methods are crucial, as Scikit-learn remains a cornerstone for many data scientists. This summary highlights the key takeaways from the conversation, emphasizing the library's relevance and the vision of Probable.
What is Scikit-Learn?
- Foundation for Data Science: Scikit-learn has emerged as a fundamental tool for data scientists, enabling them to build classifiers, time-series analyzers, and more. Its simplicity in using machine learning algorithms through Python has made it accessible and widely adopted, with cumulative downloads surpassing 1.5 billion.
- Core Functionality: Scikit-learn focuses on two main functions: fitting data and making predictions. This makes it particularly suitable for regression, classification, and clustering tasks, laying the groundwork for effective data analysis.
Probable: The Company Behind Scikit-Learn
- Origins of Probable: Probable is a spinoff from the research institute Inria in France, designed to advance the Scikit-learn project and other open-source technologies. Yann Lechelle emphasizes the company's commitment to community-driven development and sustainability.
- A New Model of Open Source: Probable operates with a mission to create open-source technologies that are resilient against commercial pressures. They focus on maintaining the integrity of Scikit-learn while making improvements and adding features to benefit users.
The Vision of Probable
- Stewardship of Open Source: Yann discusses challenges in maintaining a business model within the open-source realm. Probable’s governance structure supports long-term commitments to their mission, ensuring that Scikit-learn and similar projects remain open and community-driven.
- Growth and Innovation: With nearly 10 dedicated team members for Scikit-learn, the focus is on enhancing functionalities, such as better tools for model introspection and integration with modern data tools.
The Future of Scikit-Learn in a Changing Landscape
- Integration with Generative AI: As the discussion highlights, Generative AI brings new paradigms to data analysis, but Scikit-learn remains relevant. Yann argues that many traditional applications, such as fraud detection and predictive maintenance, will continue to rely on Scikit-learn due to its cost-effectiveness and established reliability.
- 80/20 Rule: Research estimates suggest that about 80% to 95% of data science applications still utilize Scikit-learn due to its robust algorithms and ease of use.
Practical Applications and Use Cases
- Real-World Applications: Guillaume shares insights into various Scikit-learn applications - from healthcare for disease detection to financial services for fraud detection. The versatility and established methods make it crucial for real-world scenarios where predictive accuracy and efficiency are paramount.
- Community Contributions: With Scikit-learn and its associated libraries being open-source, community contributions play a vital role in its evolution. The podcast encourages developers to start contributing, whether through coding or improving documentation.
Conclusion: A Promising Future
As discussions turn towards future aspirations, both Yann and Guillaume express optimism about onboarding new contributors, enhancing functionalities, and staying relevant amid rapid technological advancements. With open-source ideals woven into the company's DNA, Probable aims to be a leading force in providing free and accessible data science tools that are community-oriented.
In summary, Scikit-learn continues to be an essential tool for data scientists, and with the innovative stewardship of Probable, it promises to remain relevant, efficient, and influential in the ever-evolving landscape of AI and data science.
This article provides key insights and reflections from the Practical AI podcast episode, showcasing the vital role of Scikit-learn in data science and the dedicated team behind it.
Stay tuned for more discussions on AI and data science topics.
Was this summary helpful?
Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love the changelog. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for the changelog wherever you get your podcasts. Thanks to our partners at fly.io. Launch your AI apps in five minutes or less. Learn how at fly.io.
Okay, friends, I'm here with a new friend of ours over at Timescale Aftar, so within. So Aftar helped me understand what exactly is Timescale? So Timescale is a Postgres company. We build tools in the cloud and in the open source ecosystem that allow developers to do more with Postgres. So using it for things like time series analytics and more recently, AI applications like RAG and such and agents.
Okay, if our listeners were trying to get started with Postgres, Timescale, AI application development, what would you tell them? What's a good roadmap? If you're a developer out there, you're either getting tasked with building an AI application, or you're interested in just seeing all the innovation going on in the space and want to get involved yourself. And the good news is that any developer today can become an AI engineer.
using tools that they already know and love. And so the work that we've been doing at timescale with the PGAI project is allowing developers to build AI applications with the tools and with the database that they already know and that being Postgres. What this means is that you
can actually level up your career. You can build new interesting projects. You can add more skills without learning a whole new set of technologies. And the best part is it's all open source, both PG AI and PG vector scale are open source. You can go and spin it up on your local machine via Docker. Follow one of the tutorials in the time scale blog, both these cutting edge applications like rag and such without having to learn 10 different new technologies and just using Postgres in the SQL Cray language that you will probably already know and are familiar
So, yeah, that's it. Get started today. It's a PGAI project and just go to any of the timescale GitHub repos, either the PGAI one or the PGA vector scale one and follow one of the tutorials to get started with becoming an AI engineer just using Postgres.
Okay, just use Postgres and just use Postgres to get started with AI development, build RAG, search, AI agents, and it's all open source. Go to timescale.com slash AI, play with PG AI, play with PG vector scale, all locally on your desktop, it's open source. Once again, timescale.com slash AI.
Welcome to another episode of Practical AI. This is Daniel Weitnack. I am the CEO at Prediction Guard, and I'm joined as always by my co-host Chris Benson, who is a principal AI research engineer at Lockheed Martin.
How you doing, Chris? Doing very well today. Daniel, how's it going? It's going great. I was saying that I'm really pumped to be talking about something that's near and dear to my heart over many, many years because today we have with us, Jan, who's the CEO at Probable and Guillaume, who's an open source engineer at Probable. Welcome. Thanks for having us.
Well, Jan and Guillaume are working on data science that you own, including projects like Scikit Learn, which is, of course, very near and dear to me, along with other data scientists all around the world. So Jan, if you could, since you're coming from the CEO perspective, help us understand a little bit, maybe for those that have
heard of scikit learn or some of the other projects that you're involved with, but they haven't heard of probable. If you could give us a sense of what is probable, as you mentioned, kind of in the lead up to this conversation, it's a slightly different kind of company that came about in different sorts of ways than other types of startups. So yeah, if you could give us a little bit of context, that would be great.
Well, very glad to be on the show with you today. And probable is a company that is typically known as a spin-off from a research center in France called Inuia. And Inuia is the place where this technology, cyclone, has been developed over the past 10, 15 years.
not many people know that and the project has been somewhat protected and sort of incubated within that research center. And after all that time, as you know, Seikit Hearn has been adopted or even probably participated in creating the field of data science because it is applied math and essentially has created a sort of paradigm for
how data scientists approach data science, typically through two functions, fit and predict. And the French government has a national strategy for AI, like many, many countries. And the government decided to double down on psychic earn. And they came up with a budget. They entrusted the research center with that budget. But then they also asked for the project to be break-even at some point.
And the team said, OK, break even is fine, but we don't do that in the research center. We don't break even. So why don't we call an entrepreneur to try and help us figure it out? And they called me. So I have a track record as a software engineer and an entrepreneur in tech for the past 25 press years.
But another data scientist. So I did my due diligence and I sort of, you know, dug deep to find out what this project was about under the hood. Is it any good? Is the community any good? And of course, Seki Kern is this quite amazing.
of a technology that every data scientist on the planet uses that discovered that it was downloaded 1.5 billion times, cumulatively 80 million times a month, 22% in the US, only 3% in France. So this is a project that is used all over the world. And probable is essentially the spinoff that takes all of the team, including Guillaume here from the research center,
and turns it into an open source company that inherited the mission that was initially given to the research center. And the mission is to build a suite of open source technologies, including psychic learn, but above and beyond psychic learn as well, for data science. So the scope is large, the mission is novel,
And this is what we're building, essentially. So probably is a one-year-old company that has already started doing many, many things. And, you know, GEOM is the representative here for PsycheKern, this technology that is used, again, by every data scientist on the planet.
Well, this has brought up a lot of interesting questions on my end. And I really love the part of your pitch and at least how you framed it on your website and in your materials online about data science that you own and the open source side of this.
which I know from experience, there can be some interesting challenges around finding business models that really work with open source technologies. And we've seen technologies where companies start with the posture towards open source and then gradually become more closed over time. So I'm wondering from the leadership perspective, it sounds even in the way that this company was formed, that there is a posture towards stewarding the
you know, scikit learn in these types of projects. But from your perspective, what is your posture towards stewarding these projects in the open source side? And how do you view the business element of this to make it sustainable in the longer term? So that is the hard question, but it is the one that is important here.
Cyclone is a technology that is, again, applied math. It's not rocket science, but it's applied math, and it's quite intricate. The thing is, the scientific community uses it day in, day out, and everyone depends on this. So typically, when I discovered the scope of the project and the mission that was entrusted to the research center, I realized that this project is bigger than me.
Number one, number two, the mission is to actually create more open source. In other words, in 2024, it's even more acute. Typically, big tech keeps on amassing so much power, concentration. And we could argue that they do not distribute as much as they should.
So that's not a judgment, but it is a fact and secularness precisely the contrary. It actually enables so many companies to do data science. So with that in mind, before creating the company, we decided to craft to sort of architecture for the company that would respect that.
And so, you know, before we created the company before Guillaume joined as a co-founder, before even incorporated the company, we had a template that actually created the governance, the shareholding structure, and also leveraging a new law in France that allows us to do a sort of B Corp.
So a company with a mission where the mission is clearly stated in the bylaws, and that mission is to create open source for data science. So in a way, we've created a sort of constrained environment that is unlike many companies because it's by design. This company by design has created guardrails so that the governance cannot
Take this company too far on the right, let's say, proprietary technology, or even changing the license. That's not in the cards. And we've created a sort of mechanism where, you know, if we do not uphold to the mission, then we can actually lose some of the assets such as the brand. We are the official brand operator, but the brand belongs to the Research Institute, right? Still. So there are many mechanisms, trigger mechanisms that force us
Including shareholders that we would bring in to actually bind with a mission long term. Gotcha. You've raised so many questions for me that I want to ask. I actually want to take just a moment and kind of go back because it occurred to me as we're talking about this for some folks listening who may have never even used scikit learn. They might have heard the name and stuff.
And you talked about it being applied math. Could you guys expand on that a little bit for somebody who hasn't had a chance to ever actually utilize it themselves in terms of what it's doing and kind of catch them up to us in the conversation a little bit? And then I'm going to pepper the few more questions because you've got me. You've got me really interested there. You hit so many topics on that last. Yeah. So maybe I can, I can give like a bit of background. So it's like, you know, basically the tag line is a machine learning in Python.
So let's go back to the statistical rule. So the simple answer is we try to make pretty predictive modeling. So try to use mathematics to form data, be able in the future to give answer to a specific question to a specific paradigm. The big difference with generative AI or deep learning is just that all the statistics that you have there are
simple steps. So they are fundamentals and deep learning build on those but and are just like much more let's say costly to train costly to in the inference states or not in the same scope as well and and security is like the de facto choice when you want to have like tabular data so Excel spreadsheets data stretching this way so that's the de facto way of training like to be able to
it does spreadsheets and give back some levels or some like regression let's say and whatever is like image or NLP or like this is like let's say deep learning and and transformers is more in that area so we are more like back to what was machine learning like a few days ago but that have many many many applications but
Well, considering how incredibly popular and foundational in the data science world, could you give me a little bit of a landscape view? And I'm not sure which of you would be the right one to answer, so you guys pick between yourselves. But a little bit about how that fits into the data science landscape with AI coming in, just so that with people listening, they can go, ah, I see how it fits into the many organizations and tools that are out there. How do you think about that for that?
Get back a little bit more to the organizational stuff that y'all was talking about a few minutes ago. Maybe I can answer like partly which is by giving us cases and to see that with the partner as well that we worked over the years.
To give like, where do you find machine learning? And for instant machine learning, it can be found in elsewhere, where you want to know if it works or not, then if you want to find diseases as well in some type of data, it could be as well like fraud detections in banks, in insurance, predictive maintenance, and those type of like all application that you have since like many years. So let's say the use cases are very, very large.
and what brings like secular on that is that this is not
I mean, it's not for one of the use cases. I mean, it was thought from the beginning to be general analysis that you can apply to any of those use cases and to come back to, let's say, classification and regression programs, let's say, or unsupervised learning as well, but like that you can apply it to anywhere in that field. So maybe you have something more you add. Perhaps also at the macro level is to say, you know, to say, you can there's a lot of things.
including deep learning, but to be frank, when you want to do deep learning, typically you'd go to PyTorch or TensorFlow. But for everything else, you psyche-learn. In other words, in the great AI family of algorithms, there is machine learning, and within machine learning, you have deep learning. Within deep learning, you have other categories of algorithms such as
Transformer-based models that lead to LLMs. So it's basically Russian dolls of sorts. And scikit-learn is the biggest provider of algorithms in the machine learning space. And in fact, if you look at the downloads typically, scikit-learn is downloaded as many times as PyTorch and TensorFlow combined.
which is crazy because now everyone is talking about LLMs, of course, but also deep learning because deep learning is currently in a spring state, not quite a winter yet. So of course deep learning and JNI is a wonderful breakthrough. That being said, I like to simplify sometimes the 80, 20 perrito distribution. So I had the intuition that 80% of the use cases out there
use secular when it comes to machine learning. People actually tell me, no, you're wrong. It's more like 95%, because in terms of technology that is robust, that is tried and true, that is used to actually turn a profit or return an investment.
Banks and insurance companies, Guillermo is mentioning fraud detection. Fraud detection typically uses scikit learn. And that actually saves money. Banks would be losing money without that. So it is actually quite essential. But again, it's a quite math. So scikit learn is only a facetitator to this category of problems.
What's up, friends? I'm here with a friend of mine, a good friend of mine, Michael Greenwich, CEO and founder of WorkOS. WorkOS is the all-in-one enterprise SSO and a whole lot more solution for everyone from a brand new startup to a enterprise and all the AI apps in between. So Michael, when is too early or too late to begin to think about being enterprise ready?
It's not just a single point in time where people make this transition. It occurs at many steps of the business. Enterprise single sign on, like SAML auth, you usually don't need that until you have users. You're not going to need that when you're getting started. And we call it an enterprise feature, but I think what you'll find is there's companies when you sell to like a 50 person company, they might want this. They actually, if especially if they care about security, they might want that capability in it. So it's more of like SMB features even if they're tech forward.
At WorkOS, we provide a ton of other stuff that we give away for free for people earlier in their lifecycle. We just don't charge you for it. So that auth kit stuff I mentioned, that identity service, we give that away for free up to a million users, one million users. And this competes with auth zero and other platforms that have much, much lower free plants. I'm talking like 10,000, 50,000, like we give you a million free.
Because we really want to give developers the best tools and capabilities to build their products faster, you know, and to go to market much, much faster. And where we charge people money for the service is on these enterprise things. If you end up being successful and grow and scale up market, that's where we monetize. And that's also when you're making money as a business. So we really like to align, you know, our incentives across that.
So we have people using AuthKit that are brand new apps, just getting started, companies in Y Combinator, side projects, hackathon things, you know, things that are not necessarily commercial focus, but could be someday they're kind of future proofing their tech stack by using WorkOS.
On the other side, we have companies much, much later that are really big who typically don't like us talking about them. They're logos, you know, because they're big, big customers. But they say, Hey, we tried to build this stuff or we have some existing technology, but I'm sort of been happy with it. The developer that built it maybe has left. I was talking last week with a company that does over a billion in revenue each year. And their skim connection, the user provisioning was written last summer by an intern who's no longer obviously at the company. And the thing doesn't really work.
And so they're looking for a solution for that. So there's a really wide spectrum. We'll serve companies that are in a, you know, their offices in a coffee shop or their living room all the way through. They have a, you know, their own building in downtown San Francisco or New York or something. And it's the same platform, same technology, same tools on both sides. The volume is obviously different and sometimes the way we support them from a customer support perspective is a little bit different. Their needs are different, but same technology, same platform.
just like AWS, right? You can use AWS and pay them $10 a month. You can also pay them $10 million a month. Same product or more for sure or more. Well, no matter where you're at on your enterprise ready journey, work OS has a solution for you. The trusted by perplexity, copy.ai, loom, for sale, indeed, and so many more.
You can learn more and check them out at workos.com. That's W-O-R-K-O-S.com. Again, workos.com.
So, Jan, you were kind of already going there and I love the direction that you're going with this, but I think maybe I could tee up a softball for you here because I'm personally passionate about the answer to this question and you probably have a better view on it.
But there might be people out there maybe listening to this podcast who are thinking, well, now that we have gen AI, we have large language models, I could put in a prompt to one of these models to do fraud detection or to find entities in text.
or to make some prediction of a classification. And sometimes that works. And so maybe there's people thinking, well, there's these general purpose large models out there. How does that change the way that something like scikit learn plays in industry?
And I personally would argue and think that this actually makes scikit learn more valuable, if anything, rather than less valuable in terms of the ways that it can be combined, even as a tool that's orchestrated with JNI models. But I'm curious, your perspective on this from the business side, and maybe Guillaume has some ideas on the technical side.
Yeah, so, um, the second turn typically is this one technology that is patrimonial. In other words, it belongs to everybody. Um, in fact, there's another stat when you look at, uh, you know, uh, the figures that are public, by the way, the number of dependencies. So second turn is, is actually used by nearly 900,000 projects on GitHub.
So there's nearly a million projects that depend on cyclone, and there's a new law that I discovered recently, someone mentioned that in this effect, which means that something that's been used long enough will remain important for long enough.
So not saying that Secularn will go the way of Cobol, but Secularn is here to stay, and we are with the community, the guardians of that. So we're going to make sure that Secularn remains there forever for companies that actually need it in a stable version. And of course, Guillaume and the team are building up new features as we go. So there's a dedicated effort, and I should say that we have carved out
Nearly 10 people in the team are doing only that, contributing to Psychically and the other associated libraries. Now, your question, Daniel, is whether Psychically and we'll be obsolete in, say, a number of years because general purpose technology has made it irrelevant in some ways.
Number one, it's like it turned extremely frugal. It actually works on CPUs. And it is well controlled, well understood. It's actually quite predictable in some ways, whereas deep learning is usually known as a black box where it's really really hard to introspect.
And so the second one does produce for certain categories of problems. Things that I actually working quite well more so than large language models for sure today and more so than any sort of.
Deep learning based technology that we understand today. Now, it is possible that with additional data, additional training and techniques and even evolutions on the transformer based model, we could improve and probably render obsolete cycle.
To us and then Guillaume and i we talk about that and with the team we also experiment with other lambs and we are also trying to figure out how we can use these new technologies to actually help our first persona and that is the data scientist. So we are a technology provider to help data scientists.
Increasingly so the data scientist in enterprises because we will be creating value adding services and solutions so that we can generate revenue to sustain our mission. So the goal for us is to actually project ourselves while contributing to open source but also create.
a sort of business value proposition not dissimilar to red hat because that is the closest type of company that we identify with in terms of spirit to that point that you're making right there. I'd like to get back to something that you said earlier that feels like you're kind of tying back to it anyway there and that's that.
You talked about the mission to create more open source, and the mission that you're trying to create this environment that you're describing by design, you said, and that with scikit-learn here to stay for the long haul, it's going to be something that is not going away soon. It's solving such a high percentage of the problems. Could you describe a little bit about what you're thinking around that in terms of further developing this particular set of software and the ecosystem around it,
Uh, so that we have the benefit for, you know, for, for many years to come. How, how are you approaching that? So the company is, is built with multiple business units, if you wish. That's, that's a big word for startup, right? But we have multiple, uh, revenue lines and multiple activities, even within the open source team.
Which is dedicated, so you know, perhaps can elaborate on some of the other libraries that we support that complement cyclone. So, you know, that's one way to answer the question, but also we are building a new product which I call reversible SATs.
So we are building a product that will provide additional value to data scientists. And the goal is to create a sort of, I don't want to use the term coop pilot, because that is too close to LLMs. But it is the spirit. We are building a companion to augment the work of data scientists all the way to teams. So that is an additional product on top of SekiCon, because SekiCon just works.
And so we don't want to change that and contrary to a company that would build a SaaS solution with a proprietary approach, we want to say, okay, whatever you guys use is fine. We need to find a way to add new value and some of it will be open source, fairly modular. But for those companies that have more money than time,
that in more service than beyond their own will have a solution for you and will make your life easier and you know data scientists are.
a new breed. It's a new type of job. It's not been around for very long. And in a way, when I talk to people, so, you know, I've been in code forever and we, you know this, right? The developers, um, when they get hired, they are turned key in some, some ways, right? That they're get environment and they know how to peer code. And that's all pretty standard. But when you talk about data scientists,
It's actually quite artisanal. It's an art and a science at the same time. And you're manipulating two objects, actual code, but data scientists are not coders. And you're manipulating actual data, that's not code, it's patterns. And so data scientists have a difficult task, which is to combine these two things and create value for the enterprise. And then they talk to business units and they're like,
What do I do with this model? How do I create an introduction, right? So there is a huge conundrum to solve and that's what we're going to do. Additionally to building open source that are modules that people can use. Maybe Guillaume can elaborate on some of the other libraries that are key to actually help. Yeah, so we're in Provenor. So we have the open source team and so we work for like many years on such an already, but we see the importance and
as a community, which is the importance of putting models into productions and as well getting closer to the data sources. So we are just like working on libraries that should like make those come together. So, for instance, we have a library that is on the MLOps side that is called Scups.
that we work a bit to make persisting more secret in some way. But we look at how to bring databases like SQL words into closer to the machine learning models. So how can you transform data with states, with different tables and how you can be in your Python words? We were caring so much about SQL, for instance, and how you can bring this into scikit-learn.
And we've been taking on as well, we want to improve whatever is visualization, evaluations, inspection of models, which is on the top of just like training and algorithm. So we want to augment all those aspects beyond us. And either it's in cycling or either this is like library connected to cycling, let's say. So the one before it's called Scrub, by the way. So it's like scrubbing data. So it's scrub and scrubs are two libraries that we look at.
So, you know, as we're kind of talking about the libraries now, you know, you have this robust open source contributor community built up around scikit learning the various projects within it. How does probable work with those? How have you guys set up that relationship? What does the governance look like on that with
You know, because you have both your core team that you alluded to earlier that's working in probable, you know, at probable on this, but you also have that larger open source community. How does that all work? Can you kind of tell us how that's evolved? I imagine it's quite mature by now.
And that's the point. The maturity means that by design, we decided to not affect the license of secular. And we're not branching it out. We're going to care for it. And so the governance of secular being so sane already means you don't touch it. If it ain't broke, don't fix it. So the governance is unchanged. So the center of gravity was at Inria, the research center, but also involving people all over the world.
I don't know, Guillaume, how many contributors, maybe 200? Or even more, I think, in a year, you have more more than that. You have maybe like maybe three, 400. And the core team is like, let's say, half of it might be around France, around Paris, around Pruebabars, but then there's like another half of each person around the world that contributes very like almost every day, let's say, by communicating with the community. And as Jan mentioned,
We didn't want to change that nothing said in that regard so the only thing that we actually did more we did more to bring transparency so to explain to people so now that we're in provable.
We feel that because we are in private entity, we need to communicate what are we doing and what are our roadmap and which community items are we going to work on just to bring more trust such that we don't like go like in the dark and that nobody knows now what we're doing. So we tried really to pay attention to every six months to mention which of the items that are defined by the community. These are not defined by probable.
But from the items, which one we have the capacity to work with the human resources that we have, let's say, at hand. So we really want to show that.
And then by design the open source team that is full time on second turn and other open source libraries means it's a cost center to the company. So that cost center is by design and we know that's a cost we have to cover. So we will cover it through different types of activities. So for instance, and this was something that was done in the past where.
brands were sponsors. So either they hire someone that becomes a core developer and they're naturally sponsoring someone to build up this technology, or they were giving money as a donation to the research center. But now that the team is with us, we are translating this into a contractual sponsorship framework.
And so brands want to contribute to secular and help us compensate for salaries. We'll get something in return, exposure, and if they actually put more money into it, then we'll have a conversations around the roadmap. Find a way to make it converge in a win-win kind of way, because Guillaume, for instance, can say, this brand wants us to do something, but it makes no sense for the community. Then we want to take that money.
for the sponsorship type of business. However, if companies want to pay us to do a certain type of paid for software, we'll look at it. But that's a different branch of the company. So we've really clearly separated. And by design, we know there's a cost to it.
And that cost is actually if we are doing well, it's compensated by the fact that we have done good by the brand. In other words, hopefully the community will actually resonate with what we're doing. And so they'll
pay us back by actually appreciating what we're doing, which will carry the message further. So we think that there is a self-fulfilling prophecy if we actually keep adding value to the whole scheme as opposed to removing value and I will not name certain projects that have chosen a different way. But on the other hand, going back to the governance of the company,
When a company flips and becomes VC-funded or only VC-funded, VCs require a sort of return on investment that is too radical. And so that sort of forces a change of posture with vis-a-vis the community and the licensing scheme.
In our case, we've actually created a structure that is balanced in terms of shareholding groups. And so we will ultimately have, that's the goal of the structure of the architecture, is to have as much money from public support than from private support. So it's, again, sort of balanced.
Thank you.
You know, when we started podcasting back in 2009, an online store was just the furthest thing from our minds. Now we have merch.changelog.com and you can go there right now in order some t-shirts and it's all powered by Shopify. What do we do before Shopify? I'll tell you, we did nothing. We couldn't sell. There were other ways, of course, but they were very hard, very difficult. Shopify let us build up an entire
front end, obviously branded like change log is. It's amazing. merch.changelog.com. And our favorite feature is we use their API to generate a new coupon code, a personalized coupon code for every guest that comes on our podcast and they get a free t-shirt from our merch store. And that's so cool. They choose the shirt they want. They use the coupon code. It arrives free of charge to them and life is amazing.
But also, you can go there right now to merch.changelog.com and buy some threads yourself. And that's awesome as well. So upgrade your business and get the same checkout we use with Shopify. Sign up for your $1 per month trial period at Shopify.com slash practical AI, all lowercase. Go to Shopify.com slash practical AI to upgrade your selling today. Again, Shopify.com slash practical AI.
Bye bye.
So as we come back out of break here, I want to turn to kind of a fun question for you. And I'd like each of you to take a swing at it because it's not specific to being the CEO or doing the technology itself. If each of you could describe kind of a cool use case, something fun or interesting or that's really captured your imagination,
with scikit learn um and kind of share that with the listeners in terms of something that that just kind of really took you uh as your thing i i'd love to hear uh i'm expecting it to be a bit different coming from each of you in your different roles but i'd love to hear kind of how how you see that and what's the thing that sticks out in your mind and you start because i have to think about it now
So it's a very technical one, let's say, but so during my PhD, I was doing classification, so which is something that I was trying to find people that has a specific type of concepts or prostate cancer versus people that didn't have it. And inside that space, you had one truly specific problem, which is called imbalanced data.
and is what introduced me basically to cyclone because I had that premise I was using cyclone for the specific issues and how to tackle those type of issues. And what is really funny is that it's how I got interested to cyclone and speak for instance with the developers and I developed one library which is called Imbalanced Learn that is merging as well with cyclone like it's compatible in some ways. And for many years I maintained that package
even when I was my thingy, as well as I could learn. And over years, years after years, we did everything by the book, basically, in title library. We implemented the arguments that were inside the literature, and everything was fine. Until that's as part of India. And now, probably, we have as well time to educate ourselves and to try to as well then bring through the documentation of I could learn to explain some concept of people. And by doing this, we find out that most of the research there
didn't look at the prem properly. And by communicating with Overcore Dev, we just found out that a huge part of this thing was just wrong, and that you should look at it in another way. And then it's pretty funny, because with this, we found some useless stuff that was in sign-in balance run. But then now we have better contents. We went to conferences to explain these programs, and people start to tell us, oh, yes, actually, that's right.
And and it's fun that you come and say that whatever you were doing like five years ago 10 years ago is actually like obsolete or not good or I mean that that's we wouldn't expect from there. And it's something that I find really like fun when you do open source, because you can.
like you are just here to contribute to something and just to bring like the best of what you do to everyone and everybody will be like thankful for that. Evan I mean and you are not defending your own like let's say scientific paper or like that's all what is true and for me that's like one experience that comes from my PhD from like now eight or nine years ago
to up where I am now, and then I see an evolution where I was with really good people, and you could correct errors that you do in the past. And actually, that will benefit everyone afterwards, because that's laning inside the documentation of secularism, or even inside the library. And then everybody will just say, the million of users will be affecting and say, oh, actually, that's good. And this is something that I would have stayed in academia, for instance.
pray that wouldn't have happened because you wouldn't have liked time or be critic enough because you would have been like in the book say like personal books and go like this. But that's one and exactly. That's good.
I might be the CEO, but I do have the impostor syndrome, because psychedelin is so impressive. It's day in, day out. I mean, that team and Guillaume is very humble and, you know, very discreet, but the amount of knowledge and the amount of technicality that is trapped inside this library is mind blowing. And you haven't met the other members of the team. It's pretty much very, very hard to compete in terms of
the amount of CPU cycles that go in there. So second learn is the gift that keeps on giving in some ways. And the team is just out of this world and nice. And it's just a pleasure to work with that team all the time. Now, the more I discover second learn and the more I find it amazing because of
what the brand meets to people. And so last week and today actually, we just released, and if you allow us to actually put the link in the notes. Of course, absolutely. We released the very first official psychic certification program. And what's amazing is that we, so this is the first time. So we're doing it step by step and the system works. People can register, they can pass or fail the test.
But without advertising, we had within a couple of days 600 registrations all over the world. But that's from India, actually, because people in India, they do work also remotely for other clients across the world. And so they do need a stamp of approval to showcase their ability to provide a service. So very interesting that this brand almost instantly
and promote a sort of service that is value adding. So that's the one thing. But then, on the more technical level, I fell in love with one new feature that came out with 1.5 of Psycheton, developed by another co-founder and core developer, Jeremy.
And that is the call back feature. Why? Because secular in fact is a platform. It is a platform. And the call back feature allows us to provide extensions if you wish, where people can
hook into the inner workings of St. Etern as they are building new models. And in fact, I find that to be essential because we are entering an age of liability with regards to AI. Companies need to be able to introspect. They need to actually find out why the model is producing such and such results. And so introspection is critical. And as I said earlier, depending is sort of a black box type of approach.
which I love, by the way. Again, in 1992 I was building deep learning models in the middle of the winter of AI at the time. But Saki-Kern is actually quite introspective, quite transparent, frugal, as I said. And so callbacks are yet another feature that provides actual introspection into how we build models. Because talk about insurance companies, fraud detections, you've got human beings at the end of the
of the spectrum being handled by algorithms. And so that is critical. And I think we fulfill a very important need with these features. So again, I keep hearing the gift that keeps on giving and I'm impressed every day with a bit of an imposter syndrome because that team is just so powerful with this tool.
And speaking of this team, Guillaume, I'm going to throw a question at you. People out there have been listening to this. They're kind of going, OK, I want to dig into this. So you're going to get some new developers that are going to come. How should they engage? How should they find and get started in the projects to develop? What's a good onboarding path for those developers?
Pray the best onboarding pass is like, if you have a chance that inside your local community, there's some people that do what we call first-time contribution to open source or like coding sprints, go speak to those people because I mean, they will help you to get onboard. But then like, if you are behind your computer and then you don't know where to start is where we have like, documentations that describe what do we call contribution because contribution is not only coding, it could be speaking
debugging, documenting, organizing sprints and those type of things. And we have what we consider as a contribution and how you can help basically and where you can help. So of course, the natural thing is to come and code. And then we explain you how to start with that. So this is on the documentation channel.
the documentation webpage and afterwards everything is online and public so there is nothing private so if we have different channel of communications the main one is github and it's going through the issue tracker or the pull request like depending of which side you are and the code developer will be
I would say 24 hours over 24 because we are around the world. So that's like, if I'm sitting there, somebody else in Australia or in the US that break and like just answer to you. And then we're just like, give you feedback. And then, and this is where your journey starts. You should not be shy and you should not be scared of making a mistake because we are not judgmental. That's we all started by that stage of like saying like, I don't know what I'm doing and I need to
As people, what should I do? And that's a normal step. And afterwards, you just grow with the community and then the community bring you over. I mean, but the most difficult thing is, yeah, it's the first step like engaging and saying like, so I'm the imposter syndrome as well. But that people say, I don't want, I mean, like this is like those very skilled people. They will never want to speak to me. And that's not the case. So just come and just try you best. And then people will just communicate with you for sure. Great guidance there.
As we wind up I'd like to get for each of y'all for both probable and for psychic learn kind of what you think about for the future and you know and and i'll let you define what time span the future is you know whether it's you know a few months or years out.
But I'd really like to wind up, paint us a picture of when the duties of the day have finished and you're just relaxing and you're thinking about what's possible going forward. What do you think about?
I'll go with the mission. The mission is bigger than me, bigger than us. And so that's why the governance creates a self-sustaining model. So of course, it's not trivial. So there's another work to achieve the mission, long-term, but that mission ends up with an IPO. In other words, this company is not meant to be sold or wrapped up. The goal is to do an IPO so that this company can carry on with the mission and allowing people to invest.
and be part of that story. And that's why earlier Daniel asked the question about, you know, investors and all that. So we do have 70 individual investors, including people who were contributors or are contributors to psychic earn.
who don't have the chance yet to be employees full time of the company. So the goal is to create this sort of dynamic vehicle. And if we look at the North Star, there is no such company today that is the provider of open source machine learning technology. That company does not exist.
And we aim to be that because we need that in an age where there's too much concentration within just a handful of players. That's not okay. It's not okay for the global south. It's not okay for Europe, which is lagging behind. But it's not even okay for the US. The US may have big tech, but that's not okay as a single model. We need people to own their data science. That's why that is our tagline. That was good. Guillaume, what are your thoughts?
Yeah, so maybe more on, so on pro level, I'm really thinking that we have emissions, let's say to help more data scientists, but I will speak more about like a, about cyclotron and endo ecosystem. So for me, the mission is we should stay focused on what's happening out there and make sure that cyclone is still relevant. So we have the production model that's fine.
but we need as well to understand where this is deployed and how this is used because we can make such progress that bringing, for instance, make it easier to bring databases to secular or to bring secular models into productions and to reduce friction and everything and as well bring values on understanding the model. I mean, we are speaking about AIX as well in Europe now. So I'm sure like there's like plenty of, let's say, are we aware we can have really an impact
And then there's, well, technology that's moved very fast. So for instance, before we knew pandas, now this is polar. So we need to move like in the fraction of sequencing, how do we like deliver value to the user that just makes a switch and still can you say it learn like can I can we do like accept those things. And then so we have to make this audit of what's happening. So this is difficult to say where we will be in five years.
Because in five years, we have all those things that can, let's say, we have the full chain of machine learning that we will be here, so we should be aware, but we should be aware of whatever moves very fast around us to stay relevant with it. That was well said too. Gentlemen, you guys have done a fantastic job of teaching the rest of us about this, and thank you very much for coming on the show today. You're welcome. I'll wear the pleasure.
All right, that is our show for this week. If you haven't checked out our changelog newsletter, head to changelog.com slash news. There you'll find 29 reasons. Yes, 29 reasons why you should subscribe. I'll tell you reason number 17, you might actually start looking forward to Mondays. Sounds like somebody's got a case of the Mondays.
28 more reasons are waiting for you at changelog.com slash news. Thanks again to our partners at flight.io to break master cylinder for the beats and to you for listening. That is all for now, but we'll talk to you again next time.
Was this transcript helpful?
Recent Episodes
Full-duplex, real-time dialogue with Kyutai
Practical AI: Machine Learning, Data Science
Kyutai research lab's real-time speech-to-speech AI assistant, Moshi models, and future plans are discussed, with focus on small models and French AI ecosystem.
December 04, 2024
Clones, commerce & campaigns
Practical AI: Machine Learning, Data Science
Chris and Daniel discuss potential impacts of a second Trump term on AI companies, policy shifts, and innovations; examine new models like Qwen closing the gap between open and closed systems; and explore AI tools for clones and commerce, focusing on digital convenience vs. nurturing human connections.
November 29, 2024
Creating tested, reliable AI applications
Practical AI: Machine Learning, Data Science
Discussion on strategies to improve AI applications' performance from prototype to production, behavior testing, and reflections on recent slowness in releasing frontier models by Chris and Daniel.
November 13, 2024
AI is changing the cybersecurity threat landscape
Practical AI: Machine Learning, Data Science
This week's podcast features Gregory Richardson and Ismael Valenzuela discussing how AI impacts the threat landscape, emphasizing the need for human defenders, and describing the ongoing AI standoff between cyber threat actors and cyber defenders.
November 05, 2024
Ask this episodeAI Anything
Hi! You're chatting with Practical AI: Machine Learning, Data Science AI.
I can answer your questions from this episode and play episode clips relevant to your question.
You can ask a direct question or get started with below questions -
What is Scikit-Learn?
Who develops Scikit-Learn?
Why is Scikit-Learn relevant in a Generative AI landscape?
How many people work on Scikit-Learn at Probable?
What are real-world applications of Scikit-Learn?
Sign In to save message history