How to Enhance Your dbt Project With Large Language Models

enJune 03, 2024

Programming Tech Brief By HackerNoon

Podcast Summary

Dbt LLMs: Use LLMs within your dbt environment to perform natural language processing tasks directly in your data transformation workflows, improving data analysis efficiency and effectiveness.
You can enhance your dbt project by using large language models (LLMs) for processing unstructured text data. This can be particularly useful when dealing with data such as customer reviews, titles, descriptions, and Google Analytics sources, which may require categorization, sentiment analysis, or other natural language processing tasks. There are several ways to approach this, including using machine learning models or calling an LLM outside of the dbt flow. However, keeping these tasks inside your dbt environment as one of the dbt models is also an option, especially as Python dbt models continue to evolve. To get started, you'll need to set up your dbt project, which can be done by cloning the example project from GitHub and setting up your profiles.yml file. You'll also need to set up your database, such as Snowflake, and prepare your source data. Currently, dbt Python models only work with Snowflake, Databricks, and BigQuery, so keep that in mind. If you already have a dbt project and data, you can skip the setup section and jump to the guide on how to use the OpenAI API in your dbt project. By using LLMs within your dbt environment, you can perform natural language processing tasks directly in your data transformation workflows, making your data analysis more efficient and effective. For more information, check out the links provided in the text.
Preparing dataset for LLMs text classification: Obtain OpenAI API key, set up external access integration in Snowflake, create/obtain categories, download/prepare dataset, update dbt project, set OpenAI API key, and set spending limit
To work with large language models (LLMs) like OpenAI's API and use it for text classification tasks, you need to prepare your dataset and set up the necessary integrations. First, obtain the OpenAI API key and set up external access integration in Snowflake if you use it. Then, create or obtain a list of categories for your text classification task. You can create a list manually or let the LLM suggest categories based on a sample of your data. Manually creating a list ensures predictable and stable categories, while letting the LLM suggest categories can lead to less predictable results but may still be suitable depending on your use case. To download and prepare the dataset, use the metadata package from the tidy Tuesday repository or the lightweight version from the author's repository. Load the dataset into your database, update the dbt project to match your database and schema names, and set up the OpenAI API key in the Snowflake integration. Remember, the OpenAI API is pay-as-you-go, so you'll need to set a spending limit to avoid unexpected charges. Additionally, if you use Snowflake, make sure to set up external access integration for the OpenAI API to allow your dbt Python models to access it. Overall, preparing your dataset and setting up the necessary integrations are crucial steps to effectively use LLMs like OpenAI's API for text classification tasks.
OpenAI API in dbt: Set temperature to 0, make model incremental, and turn off full refreshes to efficiently use OpenAI's API in a dbt Python model for text categorization and minimize costs.
To ensure consistency in generating text categories using OpenAI's API in a dbt Python model, it's recommended to set the temperature parameter to 0 and make the model incremental to prevent unnecessary full refreshes and reduce API costs. First, let's prepare the base for the dbt model by setting up the config and connecting to the OpenAI API. We'll use the R package 'dataset' to extract package titles for categorization. In the dbt model, we can pass the dbt config via the backslash dot method and include package requirements. Next, we make the model incremental and turn off full refreshes to save on API costs. This will ensure that only new or changed data is categorized, preventing unnecessary repetition. Additionally, adding incrementality logic to the incremental run will further optimize the process. In summary, setting the temperature to 0, making the model incremental, and turning off full refreshes are crucial steps to effectively use OpenAI's API in a dbt Python model for text categorization while minimizing costs.
Text Classification Optimization: To minimize costs and ensure stable results when processing large numbers of titles using OpenAI API for text classification, optimize by removing previously categorized titles, sending data in batches, using clear prompts, and being aware of OpenAI API pricing
To process a large number of titles from a dataset using OpenAI API for text classification, it's essential to optimize the process to minimize costs and ensure stable results. Here are some key strategies: 1. Remove previously categorized titles from each run, except the first one, to avoid redundant processing. 2. Send data to OpenAI API in batches to reduce costs. A batch size of 5 titles works well. 3. Use clear and concise prompts to avoid repetition and SQL injections. 4. Be aware of OpenAI API pricing, which is based on the number of tokens requested and returned. 5. Use tools like Tiktokken to estimate token usage or the official OpenAI tokenizer to evaluate the cost of specific texts. For instance, with a dataset of approximately 18,000 titles, the total input tokens would be around 320,000, and the output tokens would be approximately 140,000 if we use a batch size of 5. The costs for the full scan would be around $1.4 for the smaller model and $3.6 for the larger model. To summarize, optimizing the text classification process involves removing previously categorized titles, sending data in batches, using clear prompts, and being aware of OpenAI API pricing to minimize costs and ensure stable results.
DBT model efficiency and growing R usage: The DBT model efficiently categorized 18,000 packages, with R being the most popular data visualization tool and increasing in use as a data processing tool. The most significant growth occurred in 2019, and alternative approaches could save costs but require more engineering effort. The model should focus on one job and use JSON output for stability.
The DBT model effectively categorized all 18,000 packages without any gaps, proving to be cost-efficient and protective against multiple dbt runs. The top category, accounting for 6%, highlights the popularity of R as a data visualization tool, particularly with packages like Shiny and Plotly. The top two growing categories in 2023 suggest R's increasing use as a data processing tool. The most significant year-to-year growth among the top 30 categories occurred in 2019, following the release of influential papers like "Attention is All You Need" and "GPT." Moving forward, alternative approaches like GPT embeddings could be explored for cost savings, although they require more engineering effort. It's also advisable to consider removing this part from dbt and pushing it to cloud functions or other infrastructure. The model should focus on one job, and adding logic could lead to rerunning it, which should be avoided, especially in multiple developer environments. Additionally, using a response with a delimiter can be unstable, as the response might not provide the expected number of elements, making it challenging to map initial titles to the list of categories. To address this, requiring JSON output is recommended, as it allows for a more stable and predictable response format, even though it will be more expensive due to increased response size.
JSON switch impact on classification quality: Switching to JSON for gpt3 led to decreased classification quality, but an alternative solution using Cortex in Snowflake was effective. Joel Labas' dbt blog post provides the full code and resources for further exploration.
The switch to JSON for gpt3 resulted in a significant decrease in classification quality. However, an alternative solution exists using Cortex in Snowflake. For those interested, Joel Labas wrote a comprehensive post about this issue on the dbt blog. The post includes the full code on GitHub for reference. Additionally, there are links to a Tableau public dashboard and the tidy Tuesday r dataset for further exploration. Overall, this Hackernoon story highlights the importance of considering alternative solutions when encountering unexpected issues with data processing tools.

Recent Episodes from Programming Tech Brief By HackerNoon

AOSP and Linux Cross Border Convergence! Look at OpenFDE, New Open Source Linux Desktop Environment

This story was originally published on HackerNoon at: https://hackernoon.com/aosp-and-linux-cross-border-convergence-look-at-openfde-new-open-source-linux-desktop-environment.
Open Fusion Desktop Environment is a new Linux desktop environment design exploration project, similar to KDE and GNOME.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #linux, #open-source-software, #android, #aosp, #linux-cross-border, #linux-desktop-environment, #openfde, #open-fusion-desktop, and more.

This story was written by: @hacker-awaryd7. Learn more about this writer by checking @hacker-awaryd7's about page, and for more stories, please visit hackernoon.com.

OpenFDE (Open Fusion Desktop Environment) is a new Linux desktop environment design exploration project, similar to KDE and GNOME, focusing on enhancing the user experience from login to runtime.

Programming Tech Brief By HackerNoon

enJuly 24, 2024

android

linux

open-source-software

linux-desktop-environment

How to Build Your Own TODO-list Service With Golang and MongoDB

This story was originally published on HackerNoon at: https://hackernoon.com/how-to-build-your-own-todo-list-service-with-golang-and-mongodb.
Many have wondered how a simple task sheet or applications that provide such functionality work. In this article, we will write a small TODO service.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #mongodb, #golang, #microservices, #compass, #todolist, #how-to-build-a-todo-list, #golang-tutorial, #mongodb-guide, and more.

This story was written by: @mrdrseq. Learn more about this writer by checking @mrdrseq's about page, and for more stories, please visit hackernoon.com.

Many have wondered how a simple task sheet or applications that provide such functionality work. In this article, we will write a small TODO service.

Programming Tech Brief By HackerNoon

enJuly 24, 2024

how-to-build-a-todo-list

Duplicating a Database Record in Laravel

This story was originally published on HackerNoon at: https://hackernoon.com/duplicating-a-database-record-in-laravel.
Using the replicate() method to duplicate a database record in Laravel.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #php, #laravel, #programming, #github-copilot, #chatgpt, #software-development, #database, #web-development, and more.

This story was written by: @zachflower. Learn more about this writer by checking @zachflower's about page, and for more stories, please visit hackernoon.com.

Using the replicate() method to duplicate a database record in Laravel.

Programming Tech Brief By HackerNoon

enJuly 23, 2024

Code Smell 260 - Crowdstrike NULL

This story was originally published on HackerNoon at: https://hackernoon.com/code-smell-260-crowdstrike-null.
Learn how to avoid the null trap in privilege mode drivers to prevent system crashes, security risks, and instability.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #clean-code, #null, #null-checks-in-code, #privilege-mode-drivers, #windows-programming, #null-pointers, #memory-access-violation, #code-quality, and more.

This story was written by: @mcsee. Learn more about this writer by checking @mcsee's about page, and for more stories, please visit hackernoon.com.

Using null pointers in critical code can crash your system. Avoid nulls, use address sanitizers, apply defensive programming, and improve QA testing to prevent memory access violations, system instability, and security risks in privilege mode drivers.

Programming Tech Brief By HackerNoon

enJuly 23, 2024

null

clean-code

code-quality

privilege-mode-drivers

null-checks-in-code

memory-access-violation

null-pointers

windows-programming

Improving No-Code APIs with PostgreSQL, PostgREST, and Apache APISIX

This story was originally published on HackerNoon at: https://hackernoon.com/improving-no-code-apis-with-postgresql-postgrest-and-apache-apisix.
At Swiss PgDay, I shared how to simplify API creation using PostgreSQL, PostgREST, and Apache APISIX, including rewriting requests for cleaner URL handling.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #apache-apisix, #apache-apisix-plugin, #url, #urls, #rewriting, #url-rewrite, #pgday, #no-code-api, and more.

This story was written by: @nfrankel. Learn more about this writer by checking @nfrankel's about page, and for more stories, please visit hackernoon.com.

Programming Tech Brief By HackerNoon

enJuly 20, 2024

Mastering JavaScript Objects: A Comprehensive Guide

This story was originally published on HackerNoon at: https://hackernoon.com/mastering-javascript-objects-a-comprehensive-guide.
Exploring Object Literals, Properties, Methods, and Object Destructuring, Custom constructors, Mechanism for inheritance and object, and Built-in Objects.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #javascript, #js, #objects, #javascript-objevts, #javascript-guide, #web-development, #programming, #coding, and more.

This story was written by: @sadanandgadwal. Learn more about this writer by checking @sadanandgadwal's about page, and for more stories, please visit hackernoon.com.

Exploring Object Literals, Properties, Methods, and Object Destructuring, Custom constructors, Mechanism for inheritance and object, and Built-in Objects. JavaScript objects are fundamental to the language, serving as versatile containers for data and functionality. We'll explore the various aspects of objects, from their creation using object literals to more advanced topics like methods and destructuring.

Programming Tech Brief By HackerNoon

enJuly 20, 2024

Code Smell 259 - Control Your Environment to Avoid Test Failures

This story was originally published on HackerNoon at: https://hackernoon.com/code-smell-259-control-your-environment-to-avoid-test-failures.
Learn how to prevent unreliable tests by generating or mocking test data, ensuring full control over the testing environment and avoiding external dependencies.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #programming, #clean-code, #technology, #software-testing, #testing, #javascript, #refactoring, #test-environment, and more.

This story was written by: @mcsee. Learn more about this writer by checking @mcsee's about page, and for more stories, please visit hackernoon.com.

Tests must be in full control.

Programming Tech Brief By HackerNoon

enJuly 19, 2024

Is Your Reporting Software WCAG Compliant? Make Data Accessible to Everyone with Practical Steps

This story was originally published on HackerNoon at: https://hackernoon.com/is-your-reporting-software-wcag-compliant-make-data-accessible-to-everyone-with-practical-steps.
One billion peoplee xperience some form of disability. Like any other software, it should be equally accessible to user
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #webdevelopment, #accessibility, #reporting, #data-analytics, #data-visualization, #powerbi-desktop, #flexmonster, #data-analysis-accessibility, and more.

This story was written by: @yuliianikitina. Learn more about this writer by checking @yuliianikitina's about page, and for more stories, please visit hackernoon.com.

One billion people, or about 15% of the world’s population, experience some form of disability. Like any other software, it should be equally accessible to users with different abilities. Even non-disabled people can benefit from improved accessibility in reporting software in numerous cases. To make data visualization tools accessible, improve the following things.

Programming Tech Brief By HackerNoon

enJuly 18, 2024

data-analysis-accessibility

flexmonster

3 High Paying Programming Languages Jobs On Developers’ Watchlists

This story was originally published on HackerNoon at: https://hackernoon.com/3-high-paying-programming-languages-jobs-on-developers-watchlists.

Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #programming, #programming-languages, #finch, #swift, #swift-programming, #zig, #career-advice, #good-company, and more.

This story was written by: @amply. Learn more about this writer by checking @amply's about page, and for more stories, please visit hackernoon.com.

Christian Rebernik, co-chief executive and co-founder of Tomorrow University of Applied Sciences, reckons you don’t need to throw the baby out of the bath water, and ditch your expertise for AI. “You can secure yourself, if you are a top-tier expert who can help teach AI in your field . You can earn from training models on data set created by you.”

Programming Tech Brief By HackerNoon

enJuly 18, 2024

programming-languages

swift-programming

Templating in Software Development: Taking a Deeper Look

This story was originally published on HackerNoon at: https://hackernoon.com/templating-in-software-development-taking-a-deeper-look.
Explore how templating in software development can streamline your projects.
Check more stories related to programming at: https://hackernoon.com/c/programming. You can also check exclusive content about #software-development, #software-architecture, #microservices, #programming-templates, #authentication, #authorization, #how-to-streamline-your-project, #hackernoon-top-story, and more.

This story was written by: @pro1code1hack. Learn more about this writer by checking @pro1code1hack's about page, and for more stories, please visit hackernoon.com.

Many core functionalities are reused across different projects. These functionalities include user authentication, payment processing, user management, and more. In this article I would like to raise the point that all of these patterns were already created by programmers from the past. Almost everything we are using right now, was already implemented. We just modify some functionality based on the specific project.