Telegram + AIMLAPI: photo text replies made easy

People send screenshots, receipts, labels, and forms in Telegram. Then you squint, zoom, retype the text, and still miss a line.

This hits support teams hardest, but agency operators and founders running community chats feel it too. With this Telegram OCR automation, you reply with clean extracted text and a quick image caption in under a minute, without manual typing.

Below, you’ll see exactly what the workflow does in n8n, what results to expect, and the few setup details that actually matter.

How This Automation Works

The full n8n workflow, from trigger to final output:

n8n Workflow Template: Telegram + AIMLAPI: photo text replies made easy

Click to explore

flowchart LR

    subgraph sg0["Step 1 · 📩 Telegram Trigger (In) Flow"]
        direction LR
        n0["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 1 · 📩 Telegram Trigger .."]
        n1["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 1.5 · 💬 Typing…"]
        n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 2 · 📷 Get Photo"]
        n3@{ icon: "mdi:cog", form: "rounded", label: "Step 3 · 🧩 Extract → base64", pos: "b", h: 48 }
        n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Step 3.5 · 🧑‍💻 Build Data URI"]
        n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Step 4 · 🧠 AIMLAPI Vision (H.."]
        n6["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 5 · 📤 Reply to Telegram"]
        n2 --> n3
        n1 --> n2
        n3 --> n4
        n0 --> n1
        n5 --> n6
        n4 --> n5
    end

    %% Styling
    classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
    classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef disabled stroke-dasharray: 5 5,opacity: 0.5
    class n0 trigger
    class n5 api
    class n4 code
    classDef customIcon fill:none,stroke:none
    class n0,n1,n2,n4,n5,n6 customIcon

The Problem: Photos in chat create slow, error-prone replies

Telegram is fast until the conversation turns into images. A customer sends a blurry shipping label. A teammate drops a screenshot of an error message. Someone shares a receipt and asks, “Can you log this?” Now you are stuck doing the worst kind of work: zooming, copying by hand, double-checking line breaks, and apologizing when you misread a character. It’s not just time. It’s momentum. The chat stalls, the user waits, and your “quick support” channel starts feeling… not quick.

It adds up fast. Here’s where it breaks down.

You end up retyping text from photos multiple times a day, and small mistakes turn into longer back-and-forth.
Different people describe images differently, so replies feel inconsistent and harder to trust.
Copying text from screenshots on mobile is frustrating, which means slower responses when you’re away from your desk.
Even when you “get it right,” you still lose context because the useful parts of the image aren’t summarized.

The Solution: Telegram photo-to-text replies using AIMLAPI

This n8n workflow turns your Telegram bot into a practical vision assistant. When someone sends a photo, the workflow grabs the highest-resolution version available, converts it into a format an AI vision model can read, and asks the model for two things: a concise description of what’s in the image, plus any readable text (OCR). Then it posts that result back into the same chat, so the user can copy, verify, or continue the conversation with clean text instead of guesswork. No custom server. No separate OCR tool to manage. It’s just Telegram, n8n, and an AIMLAPI key using an OpenAI-compatible request format.

The workflow starts the moment a Telegram photo hits your bot. n8n retrieves the file, converts it to base64, and assembles a data URI so the vision model can “see” it. AIMLAPI returns the caption and extracted text, and the bot replies in-thread while the typing indicator keeps the chat feeling responsive.

What You Get: Automation vs. Results

What This Workflow Automates

Results You’ll Get

Detects new Telegram messages that include a photo and routes them into the workflow.
Fetches the best-quality image file from Telegram automatically.
Converts the image to base64 and formats it as a vision-ready data URI.
Sends the image to AIMLAPI (OpenAI-compatible) and posts the model’s response back to the chat.

Turn a “can you read this?” photo into usable text in about a minute.
Fewer typos and misread characters, which means fewer correction loops.
More consistent answers across teammates, even when the shift changes.
Faster triage because a short caption explains what the image is.
Less mental load in chat, so you keep conversations moving.

Example: What This Looks Like

Say your team gets 20 “text in an image” requests a day in Telegram. Manually, you might spend about 3 minutes per photo to zoom, retype, and sanity-check, which is roughly an hour of busywork daily. With this workflow, the human part is basically just receiving the message (a few seconds) while n8n processes the image and replies, usually within about a minute. You still review the output when it matters, but you’re no longer doing the copying by hand.

What You’ll Need

n8n instance (try n8n Cloud free)
Self-hosting option if you prefer (Hostinger works well)
Telegram for receiving photos and sending replies.
AIMLAPI to run the vision model request.
Telegram bot token (get it from @BotFather in Telegram).
AIMLAPI API key (get it from your AIMLAPI dashboard; base URL https://api.aimlapi.com/v1).

Skill level: Beginner. You’ll connect credentials in n8n and paste an API key, then test by sending a photo.

Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).

How It Works

A photo arrives in Telegram. The Telegram Trigger listens for new messages, and it kicks off as soon as your bot receives an image.

The workflow pulls the best version of the image. Telegram stores multiple sizes; n8n retrieves the highest-resolution file so OCR quality is better and the caption is more accurate.

The image is prepared for a vision model request. n8n converts the file to base64 and assembles a data URI (basically embedding the image data into the request in a standard way).

AIMLAPI generates the caption and OCR text. The HTTP Request node sends an OpenAI-compatible “messages” payload to a vision-capable model and receives back a concise description plus extracted text.

The reply goes back to the same chat. The final Telegram node posts the result where the user already is, so the text is immediately copyable and searchable.

You can easily modify the vision prompt to match your tone, language, or formatting rules. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Telegram Trigger

Set up the workflow to listen for incoming Telegram messages with images.

Add the Telegram Trigger Intake node and set Updates to message.
Credential Required: Connect your telegramApi credentials in Telegram Trigger Intake.

If your bot isn’t receiving updates, verify the Telegram webhook URL generated by Telegram Trigger Intake.

Step 2: Connect Telegram Actions

Configure the Telegram actions that indicate activity and retrieve the photo file for processing.

Add Send Typing Indicator and set Operation to sendChatAction.
Set Chat ID in Send Typing Indicator to {{ $json.message.chat.id }}.
Credential Required: Connect your telegramApi credentials in Send Typing Indicator.
Add Retrieve Photo File with Resource set to file and File ID set to {{ $('Telegram Trigger Intake').item.json.message.photo[$('Telegram Trigger Intake').item.json.message.photo.length - 1].file_id }}.
In Retrieve Photo File, set Additional Fields → Mime Type to image/jpeg.
Credential Required: Connect your telegramApi credentials in Retrieve Photo File.

⚠️ Common Pitfall: If users send non-photo messages, Retrieve Photo File will fail. Consider adding validation if your bot receives mixed content.

Step 3: Set Up Image Processing

Convert the Telegram image into a Data URI format suitable for the vision model.

Add Convert Image to Base64 and set Operation to binaryToPropery.
Add Assemble Data URI and paste the provided JavaScript into JS Code to build the Data URI.

Execution Flow: Telegram Trigger Intake → Send Typing Indicator → Retrieve Photo File → Convert Image to Base64 → Assemble Data URI

Step 4: Set Up the Vision API Call

Send the image Data URI to the vision model and retrieve a descriptive response.

Add Vision API Request and set Method to POST and URL to https://api.aimlapi.com/v1/chat/completions.
Set Specify Body to json and JSON Body to { "model": "openai/gpt-4o", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image. Then extract any visible text (OCR). Keep it concise." }, { "type": "image_url", "image_url": { "url": "{{ $json.dataUri }}" } } ] } ], "max_tokens": 300 }.
Credential Required: Connect your aimlApi credentials in Vision API Request.

Step 5: Configure the Telegram Reply

Post the AI-generated description back to the user in Telegram.

Add Post Telegram Reply and set Text to {{ $json?.choices?.[0]?.message?.content || "Sorry, the model returned an empty response." }}.
Set Chat ID to {{ $('Telegram Trigger Intake').item.json.message.chat.id }}.
In Additional Fields, set Reply to Message ID to {{ $('Telegram Trigger Intake').item.json.message.message_id }} and Disable Web Page Preview to true.
Credential Required: Connect your telegramApi credentials in Post Telegram Reply.

Execution Flow: Assemble Data URI → Vision API Request → Post Telegram Reply

Step 6: Test and Activate Your Workflow

Validate the end-to-end flow and then enable it for production.

Click Execute Workflow and send a photo to your Telegram bot to trigger Telegram Trigger Intake.
Confirm that the bot shows typing via Send Typing Indicator, and that Post Telegram Reply returns a concise image description plus OCR text.
If the response is empty, verify the Vision API Request response payload and the Text expression in Post Telegram Reply.
Once validated, switch the workflow to Active for continuous production use.

🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Common Gotchas

Telegram credentials can expire or the bot can lose permissions in a group chat. If messages stop triggering, check the bot is still present, allowed to read messages, and the token in n8n credentials is correct.
If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
AIMLAPI requests can fail on timeouts for large images or slow responses. Increase the HTTP Request timeout and add a retry, then review the execution log in n8n for the exact status code.
Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.

Frequently Asked Questions

How long does it take to set up this Telegram OCR automation?

About 20–30 minutes if you already have your Telegram bot token and AIMLAPI key.

Do I need coding skills to automate Telegram OCR replies?

No. You’ll import the workflow, connect credentials, and adjust a prompt if you want different formatting.

Is n8n free to use for this Telegram OCR automation workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in AIMLAPI usage costs, which depend on the vision model and how many images you process.

Where can I host n8n to run this Telegram OCR automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

Can I customize this Telegram OCR automation workflow for language or formatting?

Yes, and it’s mostly prompt work. Update the instruction text sent in the Vision API Request so the model returns your preferred language, a tighter caption, or a clean “Caption:” and “Text:” layout. Common tweaks include adding brand tone, forcing bullet points for long text, and returning “No readable text found” when OCR is empty.

Why is my Telegram connection failing in this workflow?

Usually it’s the bot token in n8n credentials, or the bot isn’t allowed to read messages in the chat you’re testing. Double-check the bot is added to the group (if applicable) and that you’re sending a photo, not a file attachment type your trigger isn’t listening for. If triggers fire but replies don’t send, inspect the execution details in n8n and confirm the Post Telegram Reply node is using the right chat ID from the trigger.

How many images can this Telegram OCR automation handle?

On n8n Cloud, it depends on your monthly execution limit, and on self-hosting it depends on your server and AIMLAPI rate limits.

Is this Telegram OCR automation better than using Zapier or Make?

Often, yes. This workflow needs file retrieval, base64 conversion, and a custom OpenAI-compatible HTTP request, which n8n handles without awkward workarounds. You also get more control over retries, timeouts, and how the payload is built, which matters for vision requests. Zapier or Make can still work if you only want a basic “send image to AI, return text” flow, but costs and flexibility can get tricky as volume grows. Talk to an automation expert if you want a quick recommendation based on your chat volume.

Once this is running, photos stop being a bottleneck in Telegram. The workflow handles the repetitive copy work, and you focus on the actual conversation.

Telegram + AIMLAPI: photo text replies made easy

How This Automation Works

n8n Workflow Template: Telegram + AIMLAPI: photo text replies made easy

The Problem: Photos in chat create slow, error-prone replies