The mall package: using LLMs with data frames in R & Python | Edgar Ruiz | Data Science Lab
The Data Science Lab is a live weekly call. Register at pos.it/dslab! Discord invites go out each week on lives calls. We'd love to have you!
The Lab is an open, messy space for learning and asking questions. Think of it like pair coding with a friend or two. Learn something new, and share what you know to help others grow.
On this call, Libby Heeren is joined by Edgar Ruiz as they walk through how mall works (with ellmer) in R, and then python. The mall package lets you use LLMs to process tabular or vectors of data, letting you do things such as feeding it a column of reviews and asking mall to use an anthropic model via ellmer to add a column of summaries or sentiments. Follow along with the code here: https://github.com/LibbyHeeren/mall-package-r
Hosting crew from Posit: Libby Heeren, Isabella Velasquez, Edgar Ruiz
Edgar's Bluesky: https://bsky.app/profile/theotheredgar.bsky.social
Edgar's LinkedIn: https://www.linkedin.com/in/edgararuiz/
Edgar's GitHub: https://github.com/edgararuiz
Resources from the hosts and chat:
Ollama → https://ollama.com/download
Posit Data Science Lab → https://posit.co/dslab
mall package → https://mlverse.github.io/mall/
ellmer package → https://elmer.tidyverse.org/
Libby's Positron theme (Catppuccin) → https://marketplace.visualstudio.com/items?itemName=Catppuccin.catppuccin-vsc
GitHub repo with Libby and Edgar's code → https://github.com/LibbyHeeren/mall-package-r
LLM providers supported by ellmer → https://ellmer.tidyverse.org/index.html#providers
vitals package → https://vitals.tidyverse.org/
chatlas package → https://posit-dev.github.io/chatlas/
polars package → https://pola.rs/
narwhals package → https://narwhals-dev.github.io/narwhals/
pandas package → https://pandas.pydata.org/
LM Studio → https://lmstudio.ai/
Simon Couch's blog → https://www.simonpcouch.com/
Edgar's dataset: TidyTuesday Animal Crossing Dataset (May 5, 2020) → https://github.com/rfordatascience/tidytuesday
Libby's dataset: Kaggle Tweets Dataset → https://www.kaggle.com/datasets/mmmarchetti/tweets-dataset
Blog from Sara and Simon on evaluating LLMs → https://posit.co/blog/r-llm-evaluation-03/
Data Science Lab YouTube playlist → https://www.youtube.com/watch?v=LDHGENv1NP4&list=PL9HYL-VRX0oSeWeMEGQt0id7adYQXebhT&index=2
AWS Bedrock → https://aws.amazon.com/bedrock/
Anthropic → https://www.anthropic.com/
Google Gemini → https://gemini.google.com/
What is rubber duck debugging anyway?? → https://en.wikipedia.org/wiki/Rubber_duck_debugging
► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu
Follow Us Here:
Website: https://www.posit.co
The Lab: https://pos.it/dslab
Hangout: https://pos.it/dsh
LinkedIn: https://www.linkedin.com/company/posit-software
Bluesky: https://bsky.app/profile/posit.co
Thanks for learning with us!
Timestamps
00:00 Introduction to Libby, Isabella, Edgar, and the mall package + ellmer package
07:14 "What's the difference between using mall for these NLP tasks versus traditional or classical NLP?"
09:37 "Can mall be used with a local LLM?"
17:32 "What kind of laptop specs should I realistically have to make good use of these models?"
22:12 "Are you limited to three output options?"
22:55 "Can mall return the prediction probabilities?"
24:14 "What are a rule of thumb set of specs for a machine so local LLMs are practically feasible?"
24:47 "Would that be in the additional prompt area where you're defining things?"
25:04 "You could use the vitals package to compare models, right?"
25:24 "Can we use LM Studio instead of Ollama?"
28:35 "How do you iterate and validate the model?"
36:39 "Why use paste if it is all text?"
37:31 "Are these recent tweets (from X) or older ones from actual Twitter?"
40:23 "Is there a playlist for the Data Science Labs on YouTube?"
46:11 "Does that mean that the python version does not work with pandas?"
50:14 "Where is this data set from?"
chatlas
ellmer
Positron
tidyverse
tidyverse.org
vitals