Dataguy Documentation¶
Welcome to DataGuy — a Python package designed to simplify data science workflows using Large Language Models (LLMs). It helps with intelligent data wrangling, analysis, and visualization for small-to-medium datasets.
GitHub: View the source code on GitHub
PyPI: Install from PyPI
Documentation: Read the full documentation
demo: Try the demo
Features¶
Automated Data Wrangling: Clean and preprocess your data using LLM-generated code.
AI-Powered Data Visualization: Describe a plot in words, and let DataGuy build it.
Intelligent Data Analysis: Use natural language prompts to guide statistical summaries or comparisons.
Customizable Workflows: Integrate with pandas, matplotlib, and more.
Safe Code Execution: Built-in sandboxing to guard against untrusted code execution.
How It Works¶
DataGuy is an intelligent assistant for data exploration and analysis, powered by large language models (LLMs). This section explains how DataGuy interprets your input, generates code, handles errors, and delivers results.
Overview¶
The workflow consists of the following steps:
Model Selection DataGuy decides whether your input should be interpreted as a request for a description, a plot, or a code transformation. It selects the appropriate LLM mode accordingly.
Context Building A conversational context is created to track previous prompts, results, and errors. This ensures coherent interactions and allows for iterative improvements.
Prompt Generation Based on your task, DataGuy builds one of three prompt types:
Text mode for dataset summaries or explanations
Image mode for understanding uploaded visualizations
Code mode for generating and executing data operations
LLM Interaction The selected model writes Python code (e.g., pandas, matplotlib) or produces a natural language response. If execution fails, DataGuy resubmits the failed code with the error message for refinement.
Safe Code Execution Generated code is sandboxed and evaluated in a restricted environment to prevent dangerous operations.
Caching and Retry Logic Past results are cached to avoid duplicate computation. Failed executions are corrected automatically by feeding context back into the model.
Visual Workflow¶
Package Structure¶
Example Usage¶
from dataguy import DataGuy
import seaborn as sns
# Create the assistant
dg = DataGuy()
# Load the Iris dataset
iris = sns.load_dataset("iris")
dg.set_data(iris)
# Wrangle the dataset
cleaned_data = dg.wrangle_data()
print("Cleaned Data:", cleaned_data)
# Describe the dataset
description = dg.describe_data()
print("Dataset Description:", description)
Installation¶
Install DataGuy via pip:
pip install dataguy
Quickstart¶
import os
# Set the Anthropic API key as an environment variable
os.environ["ANTHROPIC_API_KEY"] = "your_anthropic_api_key_here"
from dataguy import DataGuy
import seaborn as sns
dg = DataGuy()
iris = sns.load_dataset("iris")
dg.set_data(iris)
dg.wrangle_data()