Vision (notebook)
In this notebook, we'll demonstrate agex
integration with vision-capable models.
We will give an agent access to the Pillow (PIL) library and ask it to perform an image manipulation task. This example highlights two core agex
principles:
- Direct Library Integration: The agent will use the real
Pillow
library, not a simplified wrapper tool. - Runtime Interoperability: The agent will receive and return actual
PIL.Image
objects, which we can use immediately in our code.
We'll use GPT-5 for this task, as it has strong vision capabilities.
import io
import PIL
from IPython.display import display, Image as IPImage
from PIL.Image import Image, open as pil_open
from agex import Agent
from agex.llm import connect_llm
# create an agent configured with a vision-capable llm
vizzy = Agent(
name="vizzy",
primer="You are an expert at image manipulation.",
llm_client=connect_llm(provider="openai", model="gpt-5"),
)
# give the agent access to the Pillow library
vizzy.module(PIL, recursive=True, visibility="low")
First, let's load the image we'll be working with. It's a photo of a homemade Halloween decoration.
We'll display a thumbnail to see what the agent will be starting with.
spider = pil_open("examples/spider.png")
# let's look at a thumbnail of the image
buffer = io.BytesIO()
spider.save(buffer, format='PNG')
display(IPImage(data=buffer.getvalue(), width=480))
Next, we'll define the task for our agent. The signature is straightforward: it takes a text prompt
and a PIL.Image
object, and it returns a new PIL.Image
object.
We'll also introduce the @task(setup=...)
parameter. This is a handy feature for efficiency. The setup
string provides code that runs in the agent's sandbox before the main reasoning loop begins.
By using setup="view_image(inputs.image)"
, we instruct the agent to immediately view the image it receives. This gives it the visual context it needs from the very start, saving an entire LLM turn that would otherwise be spent just deciding to inspect the input.
# define a task and use setup to automatically have the agent view the image
@vizzy.task(setup="view_image(inputs.image)")
def process_image(prompt: str, image: Image) -> Image: # type: ignore[return-value]
"""Process an image based on a prompt and return the new image object."""
pass
Now, let's kick off our task. We'll ask the agent to crop the image to focus on the spider and its web. The agent will receive our prompt and the spider
image object.
cropped = process_image("Crop to show just the spider & web", spider)
display(cropped)
Success! The agent correctly interpreted the prompt, analyzed the image it "viewed" in the setup step, and used its knowledge of the Pillow library to perform the crop.
The agents often don't get the crop right on the first try but will notice and self-correct after reviewing their crop with view_image
. The ability to inspect images is a primitive we provide the agents by default whenever they work with image-oriented types. This is handy not just for photos but also for plots and data visualizations.
Bonus!
Interested in seeing the agent think and self-correct?
By adding on_event=display
to a task fn call we can see all the agent's events rendered here in the notebook as they happen. This includes OutputEvent
s, which are generated whenever the agent uses functions like view_image
or print
to inspect its own work—it's the agent's equivalent of stdout.
cropped = process_image(
"Crop to show just the fountain",
spider,
on_event=display
)
display(cropped)
Crop to show just the fountain
<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1050x1396 at 0x120EF7EC0>
view_image(inputs.image)
from PIL import Image
# Access the provided inputs
img = inputs.image
w, h = img.size
# Estimated crop box to isolate the fountain on the right side near the bottom
left = int(0.60 * w)
top = int(0.42 * h)
right = int(0.97 * w)
bottom = int(0.95 * h)
cropped = img.crop((left, top, right, bottom))
# Preview the crop to verify
view_image(cropped)
# End this iteration to verify the crop visually before finalizing
task_continue("Previewing the cropped image focusing on the fountain.")
Previewing the cropped image focusing on the fountain.
from PIL import Image
img = inputs.image
w, h = img.size
# Refined bounding box to focus more tightly on the fountain
left = int(0.74 * w)
top = int(0.36 * h)
right = int(0.99 * w)
bottom = int(0.88 * h)
# Clamp just in case
left = max(0, min(left, w-1))
right = max(left+1, min(right, w))
top = max(0, min(top, h-1))
bottom = max(top+1, min(bottom, h))
cropped2 = img.crop((left, top, right, bottom))
view_image(cropped2)
task_continue("Refined crop focusing on just the fountain. If this looks good, I'll return it as the final image.")
Refined crop focusing on just the fountain. If this looks good, I'll return it as the final image.
# Return the verified cropped image
task_success(cropped2)
<PIL.Image.Image image mode=RGBA size=262x726 at 0x1213D2EA0>
Perhaps not the best cropping job. But I appreciate the effort!