Vision (notebook)

In this notebook, we'll demonstrate agex integration with vision-capable models.

We will give an agent access to the Pillow (PIL) library and ask it to perform an image manipulation task. This example highlights two core agex principles:

Direct Library Integration: The agent will use the real Pillow library, not a simplified wrapper tool.
Runtime Interoperability: The agent will receive and return actual PIL.Image objects, which we can use immediately in our code.

We'll use GPT-5 for this task, as it has strong vision capabilities.

In [ ]:

Copied!





import io
import PIL
from IPython.display import display, Image as IPImage

from PIL.Image import Image, open as pil_open

from agex import Agent
from agex.llm import connect_llm

# create an agent configured with a vision-capable llm
vizzy = Agent(
    name="vizzy",
    primer="You are an expert at image manipulation.",
    llm_client=connect_llm(provider="openai", model="gpt-5"),
)

# give the agent access to the Pillow library
vizzy.module(PIL, recursive=True, visibility="low")
import io
import PIL
from IPython.display import display, Image as IPImage

from PIL.Image import Image, open as pil_open

from agex import Agent
from agex.llm import connect_llm

# create an agent configured with a vision-capable llm
vizzy = Agent(
    name="vizzy",
    primer="You are an expert at image manipulation.",
    llm_client=connect_llm(provider="openai", model="gpt-5"),
)

# give the agent access to the Pillow library
vizzy.module(PIL, recursive=True, visibility="low")

First, let's load the image we'll be working with. It's a photo of a homemade Halloween decoration.

We'll display a thumbnail to see what the agent will be starting with.

In [2]:

Copied!





spider = pil_open("examples/spider.png")

# let's look at a thumbnail of the image
buffer = io.BytesIO()
spider.save(buffer, format='PNG')
display(IPImage(data=buffer.getvalue(), width=480))
spider = pil_open("examples/spider.png")

# let's look at a thumbnail of the image
buffer = io.BytesIO()
spider.save(buffer, format='PNG')
display(IPImage(data=buffer.getvalue(), width=480))

No description has been provided for this image

Next, we'll define the task for our agent. The signature is straightforward: it takes a text prompt and a PIL.Image object, and it returns a new PIL.Image object.

We'll also introduce the @task(setup=...) parameter. This is a handy feature for efficiency. The setup string provides code that runs in the agent's sandbox before the main reasoning loop begins.

By using setup="view_image(inputs.image)", we instruct the agent to immediately view the image it receives. This gives it the visual context it needs from the very start, saving an entire LLM turn that would otherwise be spent just deciding to inspect the input.

In [3]:

Copied!





# define a task and use setup to automatically have the agent view the image
@vizzy.task(setup="view_image(inputs.image)")
def process_image(prompt: str, image: Image) -> Image:  # type: ignore[return-value]
    """Process an image based on a prompt and return the new image object."""
    pass
# define a task and use setup to automatically have the agent view the image
@vizzy.task(setup="view_image(inputs.image)")
def process_image(prompt: str, image: Image) -> Image:  # type: ignore[return-value]
    """Process an image based on a prompt and return the new image object."""
    pass

Now, let's kick off our task. We'll ask the agent to crop the image to focus on the spider and its web. The agent will receive our prompt and the spider image object.

In [4]:

Copied!

cropped = process_image("Crop to show just the spider & web", spider)
display(cropped)
cropped = process_image("Crop to show just the spider & web", spider)
display(cropped)

Success! The agent correctly interpreted the prompt, analyzed the image it "viewed" in the setup step, and used its knowledge of the Pillow library to perform the crop.

The agents often don't get the crop right on the first try but will notice and self-correct after reviewing their crop with view_image. The ability to inspect images is a primitive we provide the agents by default whenever they work with image-oriented types. This is handy not just for photos but also for plots and data visualizations.

Bonus!

Interested in seeing the agent think and self-correct?

By adding on_event=display to a task fn call we can see all the agent's events rendered here in the notebook as they happen. This includes OutputEvents, which are generated whenever the agent uses functions like view_image or print to inspect its own work—it's the agent's equivalent of stdout.

In [4]:

Copied!





cropped = process_image(
    "Crop to show just the fountain",
    spider,
    on_event=display
)
display(cropped)
cropped = process_image(
    "Crop to show just the fountain",
    spider,
    on_event=display
)
display(cropped)

🚀 TaskStartEvent - vizzy

2025-08-20T02:22:51Z

📋 prompt:

Crop to show just the fountain

📋 image:

<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1050x1396 at 0x120EF7EC0>

🧠 ActionEvent - vizzy

2025-08-20T02:22:51Z

💭 Thinking:

This code was automatically run to provide context for the task.

🐍 Code:

view_image(inputs.image)

🤖 OutputEvent - vizzy

2025-08-20T02:22:51Z

📤 Output:

🧠 ActionEvent - vizzy

2025-08-20T02:23:14Z

💭 Thinking:

Let me examine the data first, then crop to isolate the fountain on the right side of the image. I’ll estimate a bounding box using relative coordinates based on the image size, preview the result, and adjust if needed before finalizing.

🐍 Code:

from PIL import Image

# Access the provided inputs
img = inputs.image
w, h = img.size

# Estimated crop box to isolate the fountain on the right side near the bottom
left = int(0.60 * w)
top = int(0.42 * h)
right = int(0.97 * w)
bottom = int(0.95 * h)

cropped = img.crop((left, top, right, bottom))

# Preview the crop to verify
view_image(cropped)

# End this iteration to verify the crop visually before finalizing
task_continue("Previewing the cropped image focusing on the fountain.")

🤖 OutputEvent - vizzy

2025-08-20T02:23:14Z

📤 Output:

🤖 OutputEvent - vizzy

2025-08-20T02:23:15Z

📤 Output:

Previewing the cropped image focusing on the fountain.

🧠 ActionEvent - vizzy

2025-08-20T02:23:42Z

💭 Thinking:

The first crop includes too much of the surrounding hedge and windows. I’ll tighten the crop to center primarily on the fountain by moving the left edge rightward, slightly raising the top, and bringing the right edge closer to the fountain, keeping a small margin. I’ll preview the refined crop and then finalize in the next step.

🐍 Code:

from PIL import Image

img = inputs.image
w, h = img.size

# Refined bounding box to focus more tightly on the fountain
left = int(0.74 * w)
top = int(0.36 * h)
right = int(0.99 * w)
bottom = int(0.88 * h)

# Clamp just in case
left = max(0, min(left, w-1))
right = max(left+1, min(right, w))
top = max(0, min(top, h-1))
bottom = max(top+1, min(bottom, h))

cropped2 = img.crop((left, top, right, bottom))

view_image(cropped2)

task_continue("Refined crop focusing on just the fountain. If this looks good, I'll return it as the final image.")

🤖 OutputEvent - vizzy

2025-08-20T02:23:42Z

📤 Output:

🤖 OutputEvent - vizzy

2025-08-20T02:23:42Z

📤 Output:

Refined crop focusing on just the fountain. If this looks good, I'll return it as the final image.

🧠 ActionEvent - vizzy

2025-08-20T02:23:58Z

💭 Thinking:

The refined crop centers closely on the fountain and includes a minimal surrounding margin. I verified the preview, and it satisfies the prompt to show just the fountain. I’ll now return this cropped image.

🐍 Code:

# Return the verified cropped image
task_success(cropped2)

✅ SuccessEvent - vizzy

2025-08-20T02:23:58Z

✨ Result:

<PIL.Image.Image image mode=RGBA size=262x726 at 0x1213D2EA0>

Perhaps not the best cropping job. But I appreciate the effort!