Telegram + AIMLAPI: photo text replies made easy
People send screenshots, receipts, labels, and forms in Telegram. Then you squint, zoom, retype the text, and still miss a line.
This hits support teams hardest, but agency operators and founders running community chats feel it too. With this Telegram OCR automation, you reply with clean extracted text and a quick image caption in under a minute, without manual typing.
Below, you’ll see exactly what the workflow does in n8n, what results to expect, and the few setup details that actually matter.
How This Automation Works
The full n8n workflow, from trigger to final output:
n8n Workflow Template: Telegram + AIMLAPI: photo text replies made easy
flowchart LR
subgraph sg0["Step 1 · 📩 Telegram Trigger (In) Flow"]
direction LR
n0["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 1 · 📩 Telegram Trigger .."]
n1["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 1.5 · 💬 Typing…"]
n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 2 · 📷 Get Photo"]
n3@{ icon: "mdi:cog", form: "rounded", label: "Step 3 · 🧩 Extract → base64", pos: "b", h: 48 }
n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Step 3.5 · 🧑💻 Build Data URI"]
n5["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Step 4 · 🧠 AIMLAPI Vision (H.."]
n6["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Step 5 · 📤 Reply to Telegram"]
n2 --> n3
n1 --> n2
n3 --> n4
n0 --> n1
n5 --> n6
n4 --> n5
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n0 trigger
class n5 api
class n4 code
classDef customIcon fill:none,stroke:none
class n0,n1,n2,n4,n5,n6 customIcon
The Problem: Photos in chat create slow, error-prone replies
Telegram is fast until the conversation turns into images. A customer sends a blurry shipping label. A teammate drops a screenshot of an error message. Someone shares a receipt and asks, “Can you log this?” Now you are stuck doing the worst kind of work: zooming, copying by hand, double-checking line breaks, and apologizing when you misread a character. It’s not just time. It’s momentum. The chat stalls, the user waits, and your “quick support” channel starts feeling… not quick.
It adds up fast. Here’s where it breaks down.
- You end up retyping text from photos multiple times a day, and small mistakes turn into longer back-and-forth.
- Different people describe images differently, so replies feel inconsistent and harder to trust.
- Copying text from screenshots on mobile is frustrating, which means slower responses when you’re away from your desk.
- Even when you “get it right,” you still lose context because the useful parts of the image aren’t summarized.
The Solution: Telegram photo-to-text replies using AIMLAPI
This n8n workflow turns your Telegram bot into a practical vision assistant. When someone sends a photo, the workflow grabs the highest-resolution version available, converts it into a format an AI vision model can read, and asks the model for two things: a concise description of what’s in the image, plus any readable text (OCR). Then it posts that result back into the same chat, so the user can copy, verify, or continue the conversation with clean text instead of guesswork. No custom server. No separate OCR tool to manage. It’s just Telegram, n8n, and an AIMLAPI key using an OpenAI-compatible request format.
The workflow starts the moment a Telegram photo hits your bot. n8n retrieves the file, converts it to base64, and assembles a data URI so the vision model can “see” it. AIMLAPI returns the caption and extracted text, and the bot replies in-thread while the typing indicator keeps the chat feeling responsive.
What You Get: Automation vs. Results
| What This Workflow Automates | Results You’ll Get |
|---|---|
|
|
Example: What This Looks Like
Say your team gets 20 “text in an image” requests a day in Telegram. Manually, you might spend about 3 minutes per photo to zoom, retype, and sanity-check, which is roughly an hour of busywork daily. With this workflow, the human part is basically just receiving the message (a few seconds) while n8n processes the image and replies, usually within about a minute. You still review the output when it matters, but you’re no longer doing the copying by hand.
What You’ll Need
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Telegram for receiving photos and sending replies.
- AIMLAPI to run the vision model request.
- Telegram bot token (get it from @BotFather in Telegram).
- AIMLAPI API key (get it from your AIMLAPI dashboard; base URL https://api.aimlapi.com/v1).
Skill level: Beginner. You’ll connect credentials in n8n and paste an API key, then test by sending a photo.
Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).
How It Works
A photo arrives in Telegram. The Telegram Trigger listens for new messages, and it kicks off as soon as your bot receives an image.
The workflow pulls the best version of the image. Telegram stores multiple sizes; n8n retrieves the highest-resolution file so OCR quality is better and the caption is more accurate.
The image is prepared for a vision model request. n8n converts the file to base64 and assembles a data URI (basically embedding the image data into the request in a standard way).
AIMLAPI generates the caption and OCR text. The HTTP Request node sends an OpenAI-compatible “messages” payload to a vision-capable model and receives back a concise description plus extracted text.
The reply goes back to the same chat. The final Telegram node posts the result where the user already is, so the text is immediately copyable and searchable.
You can easily modify the vision prompt to match your tone, language, or formatting rules. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Telegram Trigger
Set up the workflow to listen for incoming Telegram messages with images.
- Add the Telegram Trigger Intake node and set Updates to
message. - Credential Required: Connect your telegramApi credentials in Telegram Trigger Intake.
Step 2: Connect Telegram Actions
Configure the Telegram actions that indicate activity and retrieve the photo file for processing.
- Add Send Typing Indicator and set Operation to
sendChatAction. - Set Chat ID in Send Typing Indicator to
{{ $json.message.chat.id }}. - Credential Required: Connect your telegramApi credentials in Send Typing Indicator.
- Add Retrieve Photo File with Resource set to
fileand File ID set to{{ $('Telegram Trigger Intake').item.json.message.photo[$('Telegram Trigger Intake').item.json.message.photo.length - 1].file_id }}. - In Retrieve Photo File, set Additional Fields → Mime Type to
image/jpeg. - Credential Required: Connect your telegramApi credentials in Retrieve Photo File.
Step 3: Set Up Image Processing
Convert the Telegram image into a Data URI format suitable for the vision model.
- Add Convert Image to Base64 and set Operation to
binaryToPropery. - Add Assemble Data URI and paste the provided JavaScript into JS Code to build the Data URI.
Execution Flow: Telegram Trigger Intake → Send Typing Indicator → Retrieve Photo File → Convert Image to Base64 → Assemble Data URI
Step 4: Set Up the Vision API Call
Send the image Data URI to the vision model and retrieve a descriptive response.
- Add Vision API Request and set Method to
POSTand URL tohttps://api.aimlapi.com/v1/chat/completions. - Set Specify Body to
jsonand JSON Body to{ "model": "openai/gpt-4o", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image. Then extract any visible text (OCR). Keep it concise." }, { "type": "image_url", "image_url": { "url": "{{ $json.dataUri }}" } } ] } ], "max_tokens": 300 }. - Credential Required: Connect your aimlApi credentials in Vision API Request.
Step 5: Configure the Telegram Reply
Post the AI-generated description back to the user in Telegram.
- Add Post Telegram Reply and set Text to
{{ $json?.choices?.[0]?.message?.content || "Sorry, the model returned an empty response." }}. - Set Chat ID to
{{ $('Telegram Trigger Intake').item.json.message.chat.id }}. - In Additional Fields, set Reply to Message ID to
{{ $('Telegram Trigger Intake').item.json.message.message_id }}and Disable Web Page Preview totrue. - Credential Required: Connect your telegramApi credentials in Post Telegram Reply.
Execution Flow: Assemble Data URI → Vision API Request → Post Telegram Reply
Step 6: Test and Activate Your Workflow
Validate the end-to-end flow and then enable it for production.
- Click Execute Workflow and send a photo to your Telegram bot to trigger Telegram Trigger Intake.
- Confirm that the bot shows typing via Send Typing Indicator, and that Post Telegram Reply returns a concise image description plus OCR text.
- If the response is empty, verify the Vision API Request response payload and the Text expression in Post Telegram Reply.
- Once validated, switch the workflow to Active for continuous production use.
Common Gotchas
- Telegram credentials can expire or the bot can lose permissions in a group chat. If messages stop triggering, check the bot is still present, allowed to read messages, and the token in n8n credentials is correct.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- AIMLAPI requests can fail on timeouts for large images or slow responses. Increase the HTTP Request timeout and add a retry, then review the execution log in n8n for the exact status code.
- Default prompts in AI nodes are generic. Add your brand voice early or you’ll be editing outputs forever.
Frequently Asked Questions
About 20–30 minutes if you already have your Telegram bot token and AIMLAPI key.
No. You’ll import the workflow, connect credentials, and adjust a prompt if you want different formatting.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in AIMLAPI usage costs, which depend on the vision model and how many images you process.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
Yes, and it’s mostly prompt work. Update the instruction text sent in the Vision API Request so the model returns your preferred language, a tighter caption, or a clean “Caption:” and “Text:” layout. Common tweaks include adding brand tone, forcing bullet points for long text, and returning “No readable text found” when OCR is empty.
Usually it’s the bot token in n8n credentials, or the bot isn’t allowed to read messages in the chat you’re testing. Double-check the bot is added to the group (if applicable) and that you’re sending a photo, not a file attachment type your trigger isn’t listening for. If triggers fire but replies don’t send, inspect the execution details in n8n and confirm the Post Telegram Reply node is using the right chat ID from the trigger.
On n8n Cloud, it depends on your monthly execution limit, and on self-hosting it depends on your server and AIMLAPI rate limits.
Often, yes. This workflow needs file retrieval, base64 conversion, and a custom OpenAI-compatible HTTP request, which n8n handles without awkward workarounds. You also get more control over retries, timeouts, and how the payload is built, which matters for vision requests. Zapier or Make can still work if you only want a basic “send image to AI, return text” flow, but costs and flexibility can get tricky as volume grows. Talk to an automation expert if you want a quick recommendation based on your chat volume.
Once this is running, photos stop being a bottleneck in Telegram. The workflow handles the repetitive copy work, and you focus on the actual conversation.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.