The mall package: using LLMs with data frames in R & Python | Edgar Ruiz

Transcript#

This transcript was generated automatically and may contain errors.

So welcome to the Data Science Lab, everybody. My name is Libby. I'm a data community manager here at Posit, and I am joined by my co-host, Isabella Velazquez. Isabella, please say hello.

Hi, everyone. Thanks for joining us.

We are so excited to be joined today by Edgar Ruiz . Edgar, he is a maintainer of the mall package, which we will be talking about today. Edgar, would you like to say hello?

Hi, everyone. Thank you for joining. Also, thank you for being here in such cold weather, and hopefully we can show some cool stuff that y'all can use.

If you have not used the mall package, and we are hopefully sticking links to everything in the Discord chat. We are talking about a package that allows you to use ellmer to connect to a variety of LLMs and then apply LLMs to your data in a programmatic way, which means like, hey, I have this column of text maybe, and I want to transform it somehow or classify it, translate it, whatever that is with an LLM, and then have a column that represents the output from the LLM. This is amazing for me, because when I first started using LLMs, I was like, okay, but how do I use this, right? To me, it was just like this is a chat thing. How do I use this programmatically if I want to apply it to my data, and the mall package abstracted away all the difficulty for me, and I think that that's great.

Edgar's intro to mall

Yeah, so this is an introduction of an idea that occurred to me when I was looking at some output from LLMs some time back, and I saw that being able to ask it, is this text positive or not, was actually pretty straightforward. Even a locally installed LLM would actually do a pretty decent job at it, and that's kind of what got it started because at the time that I started writing mall, because of how fast everything has evolved, year and a half, year ago, companies were reticent about sending data into the cloud, if it was a cloud provider for the LLM. Now, that's not as bad. More companies are doing that.

So at the time, most companies wanted to keep it local, and it looked like something such as a LLM model was actually going to work well that could be installed locally. So one thing that I can ask folks here today is if you start seeing things that we could actually do better or add, I kind of talked through several different things like the sentiment, classify, extract. Also, you can do your own custom one, and Libby will explain more about what that does. If you see other ones that we could possibly add, that'd be great. Also, the code is open, and it really boils down to a simple prompt that I'm using to run recursively over your dataset. So improvements and prompts and things like that are always welcome.

The GitHub, the link to the GitHub repo is in the website, so please feel free to reach out.

All right, perfect. So what Edgar was mentioning, sentiment analysis, text summarizing, classification of text, extraction of text, translation, and binary verification, true, false. These are all built into mall, and then there's also a custom. So what I'm going to do is run through some code where I show you what it looks like to get this connected to ellmer, to an actual LLM, and then what it looks like to use some of these on your data.

NLP versus LLMs for text tasks

Can I say something real quick about the NLP thing? That's the other thing that I noticed that the local LLM was actually doing real well, even though it may take a bit longer to recursively go through a dataset. It's kind of like you're paying for the time that it will take you to develop your own NLP. Yeah, it's a tax, right? It takes longer, yes, to run, but you didn't have to spend the time doing the NLP yourself and tokenizing your words, and then doing all this other analysis you have to do.

It's kind of like you're paying for the time that it will take you to develop your own NLP.

NLP, I feel like, is much more predictable. In the chat, someone asked, what's the difference between using mall for these NLP tasks versus traditional or classical NLP? The difference is really just that you are using an LLM versus using traditional NLP methods, which might be like named entity recognition or sentiment analysis using a lexicon, a defined lexicon, or let's say you might be doing something that I love to do in NLP, which is sort of like a power analysis. That means that you are creating this lexicon or this database of maybe words or phrases that indicate certain things, and then you are basically using code and math to compare the text that you are working on to what you have in your database of what is a positive word or a negative word or a neutral word or a powerful word.

Okay, so before I move on I want to say that prompt took iteration. It did not work the first like eight times. I did it. It didn't do what I wanted it to do. I had to iterate and iterate and iterate until my prompt was going to do what I wanted it to do reliably.

I wanted to also before I hand this over for the last 15 minutes to Edgar here talk about an ellmer, not mall, an ellmer function, which is pretty nice. It's the token usage function. I think this is pretty new right Edgar. It tells you how many tokens you've used. So hey, here are the models that you've used, here are the number of input and output tokens. Notice that I don't have anything that's cached because I told it not to cache anything. And then the price is in a — I have seen people where it has access to a price for an actual commercial model. This is me using it through Bedrock so it's not the same. But I have seen people use this when it's hooked up to a commercial model where the price actually does show.