Telegram + Gemini Vision: extract text from images

Retyping text from screenshots is the kind of “small task” that quietly ruins your day. You lose a few minutes here, make a typo there, then waste more time fixing it in Slack, a doc, or a CRM.

This Telegram OCR automation hits marketers pulling ad copy, ops teams capturing receipts or labels, and consultants collecting notes from calls and workshops. You send an image to a Telegram bot, and you get clean, copyable text back in seconds.

Below, you’ll see exactly what the workflow does, how it’s wired, what you need to run it, and where teams usually trip up when they set it live.

How This Automation Works

The full n8n workflow, from trigger to final output:

n8n Workflow Template: Telegram + Gemini Vision: extract text from images

Click to explore

flowchart LR

    subgraph sg0["Telegram Flow"]
        direction LR
        n0["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Telegram Trigger"]
        n1@{ icon: "mdi:swap-vertical", form: "rounded", label: "Clean Input Data", pos: "b", h: 48 }
        n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>get file"]
        n3["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/telegram.svg' width='40' height='40' /></div><br/>Telegram"]
        n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Gemini OCR"]
        n5@{ icon: "mdi:cog", form: "rounded", label: "Extract from File", pos: "b", h: 48 }
        n2 --> n5
        n4 --> n3
        n1 --> n2
        n0 --> n1
        n5 --> n4
    end

    %% Styling
    classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
    classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef disabled stroke-dasharray: 5 5,opacity: 0.5
    class n0 trigger
    class n4 api
    classDef customIcon fill:none,stroke:none
    class n0,n2,n3,n4 customIcon

The Problem: Screenshots Trap Your Information

Screenshots are convenient until you need the text inside them. A teammate shares a pricing table as an image. A client sends “just a quick photo” of a form. You grab a snippet of ad copy from a swipe file on your phone. Now the information is stuck. You either retype it (slow), run it through a random OCR website (sketchy), or postpone it (and forget). The annoying part is the mental context switching: zooming in, copying line by line, fixing weird spacing, and hoping you didn’t swap a 0 for an O.

The friction compounds. It’s rarely one screenshot. It’s ten.

You end up spending about 10 minutes per image just to get usable text, especially when it’s small font or messy lighting.
Typos sneak into quotes, addresses, and product specs, which means follow-up messages and avoidable back-and-forth.
OCR websites often require uploads, and frankly that can be a compliance headache for client or internal documents.
Even when you extract the text, it’s not where you need it, so you still bounce between apps to share it.

The Solution: Telegram Bot OCR Powered by Gemini Vision

This workflow turns Telegram into your “send it here, get text back” inbox. When you message your bot with a screenshot or photo, n8n grabs the image file, converts it into a format the AI can read, and sends it to the Google Gemini Vision API for analysis. Gemini extracts the text it sees (including multi-line blocks like paragraphs, menus, or snippets of UI). Then n8n posts the results right back into the same Telegram chat, so the text is immediately copyable. No downloading, no retyping, no hunting for a tool you used three months ago.

The workflow starts the moment a new image hits your Telegram bot. It then retrieves the actual photo file, converts the binary content into the payload Gemini expects, and sends an HTTP request to Gemini Vision. Finally, it replies in Telegram with the extracted text so you can paste it anywhere.

What You Get: Automation vs. Results

What This Workflow Automates

Results You’ll Get

Detects incoming Telegram images automatically via a bot trigger.
Fetches the full-resolution file from Telegram without you downloading anything.
Converts the image binary into clean data ready for OCR analysis.
Sends the image to Gemini Vision through an HTTP request and parses the response.

Turn a screenshot into copyable text in about a minute, not ten.
Fewer typos in names, SKUs, addresses, and quotes because you’re not rekeying it by hand.
A simpler “single place” workflow since Telegram becomes your intake and delivery channel.
Faster handoffs to teammates because the output is already in chat.
Less tool sprawl, which makes this easy to adopt across a small team.

Example: What This Looks Like

Say you capture 15 screenshots a week: competitor ads, analytics callouts, and random “don’t forget this” notes. If you spend about 10 minutes per screenshot retyping and cleaning it up, that’s roughly 2.5 hours weekly. With this workflow, you forward each image to Telegram (maybe 30 seconds), then wait for the reply (often under a minute). Realistically, you get back about 2 hours a week, and the text is already ready to paste into a doc or message.

What You’ll Need

n8n instance (try n8n Cloud free)
Self-hosting option if you prefer (Hostinger works well)
Telegram to receive images and return text.
Google Gemini Vision API for OCR and text extraction.
Gemini API key (get it from Google AI Studio).

Skill level: Beginner. You’ll paste in an API key and connect a Telegram bot, then test with a sample screenshot.

Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).

How It Works

A Telegram image kicks things off. When someone sends your bot a photo or screenshot, the Telegram Trigger picks it up instantly and passes along the message metadata.

The workflow normalizes what came in. n8n cleans up the incoming fields so the next steps always know which file ID to request, even if Telegram’s payload shape varies a bit.

n8n retrieves and converts the image. It calls Telegram again to download the actual file, then converts the binary so it can be safely included in the request that Gemini Vision expects.

Gemini extracts text and Telegram receives the reply. An HTTP request sends the image to Gemini Vision, n8n pulls the extracted text from the response, and the final Telegram node sends that text back to your chat.

You can easily modify the outgoing message format to include line breaks, headings, or “copy blocks” based on your needs. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Telegram Trigger

This workflow starts when a user sends a message with a photo to your Telegram bot.

Add and open Telegram Intake Trigger.
Set Updates to message.
In Additional Fields, enable Download to true.
Credential Required: Connect your telegramApi credentials.
Save the node to generate the Telegram webhook.

Tip: Make sure your Telegram bot is started in Telegram; otherwise no updates will arrive.

Step 2: Connect Telegram and Normalize the Incoming Data

Normalize the incoming update to extract the chat ID and photo file ID for downstream nodes.

Open Normalize Incoming Data and add two assignments.
Set chatID to ={{ $json.message.chat.id }}.
Set Image to ={{ $json["message"]["photo"][$json["message"]["photo"].length - 1]["file_id"] }}.
Open Retrieve Photo File and set Resource to file.
Set File ID to ={{ $json.Image.replace(/\n/g, '') }}.
Credential Required: Connect your telegramApi credentials in Retrieve Photo File.

⚠️ Common Pitfall: If the incoming message has no photo, Normalize Incoming Data will return an empty file ID and Retrieve Photo File will fail.

Step 3: Set Up Image Conversion and OCR Request

Convert the downloaded photo to base64 data and send it to Gemini for OCR.

Open Convert Binary to Data and set Operation to binaryToPropery.
Open Gemini Text Extract and set Method to POST.
Set URL to https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent.
Set Specify Body to json and Send Body to true.
Set JSON Body to ={ "contents": [ { "role": "user", "parts": [ { "inlineData": { "mimeType": "image/jpeg", "data": "{{ $json.data }}" } }, { "text": "Extract text" } ] } ] }.
Credential Required: Connect your httpQueryAuth credentials in Gemini Text Extract.

Tip: If you send PNG images, change mimeType to image/png in the JSON body.

Step 4: Configure the Telegram Reply Output

Send the extracted text back to the original Telegram chat.

Open Send Telegram Reply.
Set Text to ={{ $json.output }}.
Set Chat ID to ={{ $('Normalize Incoming Data').item.json.chatID }}.
Credential Required: Connect your telegramApi credentials in Send Telegram Reply.

Step 5: Test and Activate Your Workflow

Validate that each node passes data correctly and the OCR response returns to Telegram.

Click Execute Workflow and send a photo to your Telegram bot.
Confirm the execution order: Telegram Intake Trigger → Normalize Incoming Data → Retrieve Photo File → Convert Binary to Data → Gemini Text Extract → Send Telegram Reply.
Verify that Gemini Text Extract returns an output field and that Send Telegram Reply posts the text back to Telegram.
When successful, toggle the workflow to Active to enable production use.

🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Common Gotchas

Telegram bot credentials can expire or get rotated. If replies suddenly stop, check the Telegram credentials inside n8n first, then confirm the bot still has permission to read incoming photos.
If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
Gemini prompts and parsing matter more than people expect. If you want clean paragraphs (not a wall of text), adjust the request payload and add a little formatting logic before the Telegram reply.

Frequently Asked Questions

How long does it take to set up this Telegram OCR automation?

Less than 5 minutes if your bot and API key are ready.

Do I need coding skills to automate Telegram OCR?

No. You’ll connect Telegram, paste your Gemini API key, and run a test message to confirm it replies with text.

Is n8n free to use for this Telegram OCR automation workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Gemini API usage costs (often pennies for light OCR use).

Where can I host n8n to run this Telegram OCR automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

Can I customize this Telegram OCR automation workflow for PDFs or Google Drive uploads?

Yes, but you’ll swap the intake step. This workflow currently listens to Telegram images, then uses “Retrieve Photo File” and “Convert Binary to Data” before calling Gemini. You can replace the Telegram intake with a Google Drive trigger (new file) and keep the Gemini Text Extract step, as long as you still convert the file into the request format Gemini expects.

Why is my Telegram connection failing in this workflow?

Usually it’s a bad or rotated bot token saved in n8n. Update the Telegram credentials, then verify the bot is receiving messages (it won’t if privacy settings or chat context are wrong). Also check that the workflow is reading the right file ID after the “Normalize Incoming Data” step, because Telegram payloads can differ between photos and documents.

How many images can this Telegram OCR automation handle?

On n8n Cloud, it depends on your monthly execution limit; self-hosting has no fixed cap beyond your server. In practice, teams run this for dozens or hundreds of images a day without thinking about it, as long as you’re not hitting Telegram or Gemini rate limits. If you expect big spikes (like batch processing event photos), add simple throttling and error retries so messages don’t fail when multiple images arrive at once. The nice part is you can scale gradually: start with personal use, then roll it out to a shared team bot.

Is this Telegram OCR automation better than using Zapier or Make?

Often, yes, especially if you want control over formatting and retries. n8n is flexible when you need to fetch files, transform binary data, and make custom HTTP calls to Gemini, and it doesn’t punish you for adding logic. Zapier or Make can be quicker for simple prototypes, but OCR flows tend to get fiddly once you care about output quality. Talk to an automation expert if you want help choosing.

This is one of those automations you set up once and then wonder how you lived without it. Your screenshots stop being dead ends, and you get hours back over a normal month.

Telegram + Gemini Vision: extract text from images

How This Automation Works

n8n Workflow Template: Telegram + Gemini Vision: extract text from images

The Problem: Screenshots Trap Your Information