Telegram + Gemini Vision: extract text from images
Retyping text from screenshots is the kind of “small task” that quietly ruins your day. You lose a few minutes here, make a typo there, then waste more time fixing it in Slack, a doc, or a CRM.
This Telegram OCR automation hits marketers pulling ad copy, ops teams capturing receipts or labels, and consultants collecting notes from calls and workshops. You send an image to a Telegram bot, and you get clean, copyable text back in seconds.
Below, you’ll see exactly what the workflow does, how it’s wired, what you need to run it, and where teams usually trip up when they set it live.
How This Automation Works
The full n8n workflow, from trigger to final output:
n8n Workflow Template: Telegram + Gemini Vision: extract text from images
flowchart LR
subgraph sg0["Telegram Flow"]
direction LR
n0["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Telegram Trigger"]
n1@{ icon: "mdi:swap-vertical", form: "rounded", label: "Clean Input Data", pos: "b", h: 48 }
n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>get file"]
n3["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Telegram"]
n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Gemini OCR"]
n5@{ icon: "mdi:cog", form: "rounded", label: "Extract from File", pos: "b", h: 48 }
n2 --> n5
n4 --> n3
n1 --> n2
n0 --> n1
n5 --> n4
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n0 trigger
class n4 api
classDef customIcon fill:none,stroke:none
class n0,n2,n3,n4 customIcon
The Problem: Screenshots Trap Your Information
Screenshots are convenient until you need the text inside them. A teammate shares a pricing table as an image. A client sends “just a quick photo” of a form. You grab a snippet of ad copy from a swipe file on your phone. Now the information is stuck. You either retype it (slow), run it through a random OCR website (sketchy), or postpone it (and forget). The annoying part is the mental context switching: zooming in, copying line by line, fixing weird spacing, and hoping you didn’t swap a 0 for an O.
The friction compounds. It’s rarely one screenshot. It’s ten.
- You end up spending about 10 minutes per image just to get usable text, especially when it’s small font or messy lighting.
- Typos sneak into quotes, addresses, and product specs, which means follow-up messages and avoidable back-and-forth.
- OCR websites often require uploads, and frankly that can be a compliance headache for client or internal documents.
- Even when you extract the text, it’s not where you need it, so you still bounce between apps to share it.
The Solution: Telegram Bot OCR Powered by Gemini Vision
This workflow turns Telegram into your “send it here, get text back” inbox. When you message your bot with a screenshot or photo, n8n grabs the image file, converts it into a format the AI can read, and sends it to the Google Gemini Vision API for analysis. Gemini extracts the text it sees (including multi-line blocks like paragraphs, menus, or snippets of UI). Then n8n posts the results right back into the same Telegram chat, so the text is immediately copyable. No downloading, no retyping, no hunting for a tool you used three months ago.
The workflow starts the moment a new image hits your Telegram bot. It then retrieves the actual photo file, converts the binary content into the payload Gemini expects, and sends an HTTP request to Gemini Vision. Finally, it replies in Telegram with the extracted text so you can paste it anywhere.
What You Get: Automation vs. Results
| What This Workflow Automates | Results You’ll Get |
|---|---|
|
|
Example: What This Looks Like
Say you capture 15 screenshots a week: competitor ads, analytics callouts, and random “don’t forget this” notes. If you spend about 10 minutes per screenshot retyping and cleaning it up, that’s roughly 2.5 hours weekly. With this workflow, you forward each image to Telegram (maybe 30 seconds), then wait for the reply (often under a minute). Realistically, you get back about 2 hours a week, and the text is already ready to paste into a doc or message.
What You’ll Need
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Telegram to receive images and return text.
- Google Gemini Vision API for OCR and text extraction.
- Gemini API key (get it from Google AI Studio).
Skill level: Beginner. You’ll paste in an API key and connect a Telegram bot, then test with a sample screenshot.
Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).
How It Works
A Telegram image kicks things off. When someone sends your bot a photo or screenshot, the Telegram Trigger picks it up instantly and passes along the message metadata.
The workflow normalizes what came in. n8n cleans up the incoming fields so the next steps always know which file ID to request, even if Telegram’s payload shape varies a bit.
n8n retrieves and converts the image. It calls Telegram again to download the actual file, then converts the binary so it can be safely included in the request that Gemini Vision expects.
Gemini extracts text and Telegram receives the reply. An HTTP request sends the image to Gemini Vision, n8n pulls the extracted text from the response, and the final Telegram node sends that text back to your chat.
You can easily modify the outgoing message format to include line breaks, headings, or “copy blocks” based on your needs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Telegram Trigger
This workflow starts when a user sends a message with a photo to your Telegram bot.
- Add and open Telegram Intake Trigger.
- Set Updates to
message. - In Additional Fields, enable Download to
true. - Credential Required: Connect your telegramApi credentials.
- Save the node to generate the Telegram webhook.
Step 2: Connect Telegram and Normalize the Incoming Data
Normalize the incoming update to extract the chat ID and photo file ID for downstream nodes.
- Open Normalize Incoming Data and add two assignments.
- Set chatID to
={{ $json.message.chat.id }}. - Set Image to
={{ $json["message"]["photo"][$json["message"]["photo"].length - 1]["file_id"] }}. - Open Retrieve Photo File and set Resource to
file. - Set File ID to
={{ $json.Image.replace(/\n/g, '') }}. - Credential Required: Connect your telegramApi credentials in Retrieve Photo File.
Step 3: Set Up Image Conversion and OCR Request
Convert the downloaded photo to base64 data and send it to Gemini for OCR.
- Open Convert Binary to Data and set Operation to
binaryToPropery. - Open Gemini Text Extract and set Method to
POST. - Set URL to
https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent. - Set Specify Body to
jsonand Send Body totrue. - Set JSON Body to
={ "contents": [ { "role": "user", "parts": [ { "inlineData": { "mimeType": "image/jpeg", "data": "{{ $json.data }}" } }, { "text": "Extract text" } ] } ] }. - Credential Required: Connect your httpQueryAuth credentials in Gemini Text Extract.
image/png in the JSON body.Step 4: Configure the Telegram Reply Output
Send the extracted text back to the original Telegram chat.
- Open Send Telegram Reply.
- Set Text to
={{ $json.output }}. - Set Chat ID to
={{ $('Normalize Incoming Data').item.json.chatID }}. - Credential Required: Connect your telegramApi credentials in Send Telegram Reply.
Step 5: Test and Activate Your Workflow
Validate that each node passes data correctly and the OCR response returns to Telegram.
- Click Execute Workflow and send a photo to your Telegram bot.
- Confirm the execution order: Telegram Intake Trigger → Normalize Incoming Data → Retrieve Photo File → Convert Binary to Data → Gemini Text Extract → Send Telegram Reply.
- Verify that Gemini Text Extract returns an output field and that Send Telegram Reply posts the text back to Telegram.
- When successful, toggle the workflow to Active to enable production use.
Common Gotchas
- Telegram bot credentials can expire or get rotated. If replies suddenly stop, check the Telegram credentials inside n8n first, then confirm the bot still has permission to read incoming photos.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Gemini prompts and parsing matter more than people expect. If you want clean paragraphs (not a wall of text), adjust the request payload and add a little formatting logic before the Telegram reply.
Frequently Asked Questions
Less than 5 minutes if your bot and API key are ready.
No. You’ll connect Telegram, paste your Gemini API key, and run a test message to confirm it replies with text.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Gemini API usage costs (often pennies for light OCR use).
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
Yes, but you’ll swap the intake step. This workflow currently listens to Telegram images, then uses “Retrieve Photo File” and “Convert Binary to Data” before calling Gemini. You can replace the Telegram intake with a Google Drive trigger (new file) and keep the Gemini Text Extract step, as long as you still convert the file into the request format Gemini expects.
Usually it’s a bad or rotated bot token saved in n8n. Update the Telegram credentials, then verify the bot is receiving messages (it won’t if privacy settings or chat context are wrong). Also check that the workflow is reading the right file ID after the “Normalize Incoming Data” step, because Telegram payloads can differ between photos and documents.
On n8n Cloud, it depends on your monthly execution limit; self-hosting has no fixed cap beyond your server. In practice, teams run this for dozens or hundreds of images a day without thinking about it, as long as you’re not hitting Telegram or Gemini rate limits. If you expect big spikes (like batch processing event photos), add simple throttling and error retries so messages don’t fail when multiple images arrive at once. The nice part is you can scale gradually: start with personal use, then roll it out to a shared team bot.
Often, yes, especially if you want control over formatting and retries. n8n is flexible when you need to fetch files, transform binary data, and make custom HTTP calls to Gemini, and it doesn’t punish you for adding logic. Zapier or Make can be quicker for simple prototypes, but OCR flows tend to get fiddly once you care about output quality. Talk to an automation expert if you want help choosing.
This is one of those automations you set up once and then wonder how you lived without it. Your screenshots stop being dead ends, and you get hours back over a normal month.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.